Best Practices

Best Practices

1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following benefits:

Relieve Spark from the onus of inferring the schema
Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming.
Early detection of errors for schema mismatches.

2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.

Comments