SparkScalaNotes

Posts

Showing posts from July, 2022

Best Practices

July 22, 2022

1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following benefits: Relieve Spark from the onus of inferring the schema Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming. Early detection of errors for schema mismatches. 2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.

RDD Interface

July 21, 2022

RDD is the basic abstraction in Spark. Internally it is characterized by: Dependencies Partitions Compute Partitioner for Key Value RDD's List of Preferred locations. 4 and 5 are Optional. RDD achieves Resilience with the dependencies.