Spark堆外内存管理总结

现状

目前spark1.6版本，只能实现Execution memory部分使用堆外内存，不能实现Storage memory存储RDD使用堆外内存。

对于堆外内存的使用，目前非SQL类Spark应用使用较少，shuffle和aggregation等场景（具体使用需要跟踪下代码进行总结分类）下会使用到，因为其Schema信息相对于SQL的RDD较复杂，而SQL中的RDD数据均为简单类型数据，因此SQL类应用可以在join时候也使用到堆外内存，直接基于二进制的数据进行处理。

目前对于堆外内存的支持还不完善，所以Spark在1.5中对于SQL默认开启堆外内存使用，而到1.6版本就默认开启，但是否默认使用非堆外内存还需要确认下。
对于非SQL类应用，使用总开关进行控制，默认关闭。

未来

在Spark2.0时候会去掉对于Tachyon的支持，为使用自身的堆外存储做准备。

原文参考：https://github.com/apache/spark/pull/10752

This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.

在Spark2.x时候会实现RDD存储OFF_HEP级别使用自身的堆外存储机制。如果RDD存储需要使用堆外内存，则必须序列化

原文参考：https://github.com/apache/spark/pull/11805

Updated semantics of OFF_HEAP storage level: In Spark 1.x, the OFF_HEAP storage level indicated that an RDD should be cached in Tachyon. Spark 2.x removed the external block store API that Tachyon caching was based on (see #10752 / SPARK-12667), so OFF_HEAP became an alias for MEMORY_ONLY_SER. As of this patch, OFF_HEAP means “serialized and cached in off-heap memory or on disk”. Via the StorageLevel constructor, useOffHeap can be set if serialized == true and can be used to construct custom storage levels which support replication.

堆外内存相关设置参数

spark.unsafe.offheap
设置对于Execution memory是否开启offheap，默认关闭。只限于Exeution memory部分使用。
spark.sql.unsafe.enabled
设置对于SQL是否开启offheap，在1.5中新增，在1.6已经默认开启。

内存请求量设置

YARN mode

executorMemory：由spark.executor.memory参数决定，如果不存在则取环境变量里的SPARK_EXECUTOR_MEMORY，
overhead：由参数spark.yarn.executor.memoryOverhead来控制，如果没有设置则取executorMemory * 0.1，并且满足最小384MB。用来作为预留内存，为堆外、JVM管理（PermGen、Thread Stack等）等使用。
- 预留内存：防止内存加堆外内存总和超过executorMemory+overhead
- PermGen：永生代使用，
的

YARN只负责Container的逻辑值在调度系统中的分配，并不关注Container实际内存的需求。