Best Practices

 1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following  benefits:

  •        Relieve Spark from the onus of inferring the schema
  •        Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming.
  •        Early detection of errors for schema mismatches.

2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.


Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)