Posts

Showing posts from July, 2022

Best Practices

 1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following  benefits:         Relieve Spark from the onus of inferring the schema         Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming.         Early detection of errors for schema mismatches. 2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.

RDD Interface

 RDD is the basic abstraction in Spark. Internally it is characterized by: Dependencies Partitions Compute Partitioner for Key Value RDD's List of Preferred locations. 4 and 5 are Optional. RDD achieves Resilience with the dependencies.