Fast Data Processing with Spark

Author: Holden Karau
This Month Stack Overflow 1


by anonymous   2019-01-13

My understanding was that Spark performed all operations in memory by default?

No, actually most operators are not caching the result in memory. You need to explicitly call cache to store them in memory.

So what happens when the result of an operation is not cached, is it by default persisted to disk?

For most of operators, Spark just create a new RDD to wrap the old RDD. From "Fast Data Processing with Spark":

It is crucial to understand that even though an RDD is defined, it does not actually contain data. This means that when you go to access the data in an RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD. This means that you can chain a large number of operations together, and not have to worry about excessive blocking. It's important to note that during the application development, you can write code, compile it, and even run your job, and unless you materialize the RDD, your code may not have even tried to load the original data.

So until you call some methods to fetch the result, the computation won't start. Here the materialize operators are something like, first, collect, saveAsTextFile. The result does not store in the memory unless you call cache.

In addition, "Fast Data Processing with Spark" is a great book to learn Spark.