
The purpose is to provide a clear methodology that is easy to test on various use cases. The approach is illustrated by a guideline of six recommendations. we make the best choice at each stage of the process without going backwards. The strategy presented is said to be greedy, i.e. Each step will be materialized by a recommendation, as justified as possible.

In order to avoid an exhaustive search for the best configuration settings, which is naturally very costly, this post will exhibit actionable solutions to maximise our chances of reducing computation time. Indeed, we can influence many Spark configurations before using cluster elasticity. The objective of this article is to propose a strategy for optimizing a Spark job when resources are limited. This trend is encouraged by the ease of renting computing power from Cloud providers. In response to this problem, we often increase the resources allocated to a computing cluster. In both cases, a major concern is to optimise the calculation time of a Spark job. Finally, it is also an alternative when one wants to accelerate a calculation by using several machines within the same network. This is what we call the big data phenomenon.

When the data to be processed is too large for the available computing and memory resources. There are two scenarios in which it is particularly useful. Spark is commonly used to apply transformations on data, structured in most cases. The most famous cloud providers also offer Spark integration services ( AWS EMR, Azure HDInsight, GCP Dataproc). The momentum is supported by managed services such as Databricks, which reduce part of the costs related to the purchase and maintenance of a distributed computing cluster. This technology has become the leading choice for many business applications in data engineering. Spark is currently a must-have tool for processing large datasets. Example of a time-saving optimization on a use case.
