spark基本知识点之内存管理

2023-10-26

Spark Memory Management

Starting Apache Spark version 1.6.0, memory management model has changed. The old memory management model is implemented by StaticMemoryManager class, and now it is called “legacy”. “Legacy” mode is disabled by default, which means that running the same code on Spark 1.5.x and 1.6.0 would result in different behavior, be careful with that. For compatibility, you can enable the “legacy” model with spark.memory.useLegacyMode parameter, which is turned off by default.

Previously I have described the “legacy” model of memory management in this article about Spark Architecture almost one year ago. Also I have written an article on Spark Shuffle implementations that briefly touches memory management topic as well.

This article describes new memory management model used in Apache Spark starting version 1.6.0, which is implemented as UnifiedMemoryManager.

Long story short, new memory management model looks like this:

Apache Spark Unified Memory Manager introduced in v1.6.0+

You can see 3 main memory regions on the diagram:

Reserved Memory. This is the memory reserved by the system, and its size is hardcoded. As of Spark 1.6.0, its value is 300MB, which means that this 300MB of RAM does not participate in Spark memory region size calculations, and its size cannot be changed in any way without Spark recompilation or setting spark.testing.reservedMemory, which is not recommended as it is a testing parameter not intended to be used in production. Be aware, this memory is only called “reserved”, in fact it is not used by Spark in any way, but it sets the limit on what you can allocate for Spark usage. Even if you want to give all the Java Heap for Spark to cache your data, you won’t be able to do so as this “reserved” part would remain spare (not really spare, it would store lots of Spark internal objects). For your information, if you don’t give Spark executor at least 1.5 * Reserved Memory = 450MB heap, it will fail with “please use larger heap size” error message.
User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. In Spark 1.6.0 the size of this memory pool can be calculated as (“Java Heap” – “Reserved Memory”) * (1.0 –spark.memory.fraction), which is by default equal to (“Java Heap” – 300MB) * 0.25. For example, with 4GB heap you would have 949MB of User Memory. And again, this is theUser Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.
Spark Memory. Finally, this is the memory pool managed by Apache Spark. Its size can be calculated as (“Java Heap” – “Reserved Memory”) * spark.memory.fraction, and with Spark 1.6.0 defaults it gives us (“Java Heap” – 300MB) * 0.75. For example, with 4GB heap this pool would be 2847MB in size. This whole pool is split into 2 regions – Storage Memory andExecution Memory, and the boundary between them is set by spark.memory.storageFractionparameter, which defaults to 0.5. The advantage of this new memory management scheme is that this boundary is not static, and in case of memory pressure the boundary would be moved, i.e. one region would grow by borrowing space from another one. I would discuss the “moving” this boundary a bit later, now let’s focus on how this memory is being used:
1. Storage Memory. This pool is used for both storing Apache Spark cached data and for temporary space serialized data “unroll”. Also all the “broadcast” variables are stored there as cached blocks. In case you’re curious, here’s the code of unroll. As you may see, it does not require that enough memory for unrolled block to be available – in case there is not enough memory to fit the whole unrolled partition it would directly put it to the drive if desired persistence level allows this. As of “broadcast”, all the broadcast variables are stored in cache with MEMORY_AND_DISKpersistence level.
2. Execution Memory. This pool is used for storing the objects required during the execution of Spark tasks. For example, it is used to store shuffle intermediate buffer on the Map side in memory, also it is used to store hash table for hash aggregation step. This pool also supports spilling on disk if not enough memory is available, but the blocks from this pool cannot be forcefully evicted by other threads (tasks).

Ok, so now let’s focus on the moving boundary between Storage Memory and Execution Memory. Due to nature of Execution Memory, you cannot forcefully evict blocks from this pool, because this is the data used in intermediate computations and the process requiring this memory would simply fail if the block it refers to won’t be found. But it is not so for the Storage Memory – it is just a cache of blocks stored in RAM, and if we evict the block from there we can just update the block metadata reflecting the fact this block was evicted to HDD (or simply removed), and trying to access this block Spark would read it from HDD (or recalculate in case your persistence level does not allow to spill on HDD).

So, we can forcefully evict the block from Storage Memory, but cannot do so from Execution Memory. When Execution Memory pool can borrow some space from Storage Memory? It happens when either:

There is free space available in Storage Memory pool, i.e. cached blocks don’t use all the memory available there. Then it just reduces the Storage Memory pool size, increasing theExecution Memory pool.
Storage Memory pool size exceeds the initial Storage Memory region size and it has all this space utilized. This situation causes forceful eviction of the blocks from Storage Memory pool, unless it reaches its initial size.

In turn, Storage Memory pool can borrow some space from Execution Memory pool only if there is some free space in Execution Memory pool available.

Initial Storage Memory region size, as you might remember, is calculated as “Spark Memory” * spark.memory.storageFraction = (“Java Heap” – “Reserved Memory”) * spark.memory.fraction * spark.memory.storageFraction. With default values, this is equal to (“Java Heap” – 300MB) * 0.75 * 0.5 = (“Java Heap” – 300MB) * 0.375. For 4GB heap this would result in 1423.5MB of RAM in initialStorage Memory region.

This implies that if we use Spark cache and the total amount of data cached on executor is at least the same as initial Storage Memory region size, we are guaranteed that storage region size would be at least as big as its initial size, because we won’t be able to evict the data from it making it smaller. However, if your Execution Memory region has grown beyond its initial size before you filled the Storage Memory region, you won’t be able to forcefully evict entries fromExecution Memory, so you would end up with smaller Storage Memory region while execution holds its blocks in memory.

I hope this article helped you better understand Apache Spark memory management principles and design your applications accordingly. If you have any questions, feel free to ask them in comments.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

spark

spark基本知识点之内存管理的相关文章

Spark集群安装部署

目录一环境准备二安装步骤三使用Standalone模式四使用Yarn模式一环境准备由于Spark仅仅是一种计算机框架不负责数据的存储和管理因此通常都会将Spark和Hadoop进行统一部署由Hadoop中的HD
SparkStreaming知识总结

一流式计算的概述 1 1 什么是流式计算 1 数据流与静态数据的区别数据流指的就是不断产生的数据是源源不断不会停止静态数据指的就是存储在磁盘中的固定的数据 2 流式计算的概念就是对数据流进行计算由于数据是炼苗不断的产生的所以
windows下安装spark及hadoop

windows下安装spark 1 安装jdk 2 安装scala 3 下载spark spark下载地址 3 1安装spark 将下载的文件解压到一个目录注意目录不能有空格比如说不能解压到C Program Files 作者解压到了这
11.Linux下Spark的安装配置以及spark-shell的启动和 Spark集群环境搭建

本案例软件包链接 https pan baidu com s 1zABhjj2umontXe2CYBW DQ 提取码 1123 若链接失效在下面评论我会及时更新目录 1 安装Spark 1 先用xftp将安装包传到home hadoo
【Spark NLP】第 3 章：Apache Spark 上的 NLP

大家好我是Sonhhxg 柒希望你看完之后能对你有所帮助不足请指正共同学习交流个人主页 Sonhhxg 柒的博客 CSDN博客欢迎各位点赞收藏留言系列专栏机器学习 ML 自然语言处理 NLP 深度学习 DL fore
Kafka传输数据到Spark Streaming通过编写程序java、scala程序实现操作

一案例说明现有一电商网站数据文件名为buyer favorite1 记录了用户对商品的收藏数据数据以 t 键分割数据内容及数据格式如下二前置准备工作项目环境说明 Linux Ubuntu 16 04 jdk 7u75 lin
大数据--pyspark远程连接hive

上一篇文章介绍了python连接hive的过程通过地址端口号访问到hive并对hive中的数据进行操作这一篇文章介绍一下怎么通过windows本地pyspark 本地部署好的spark 远程虚拟机的hive 完成本地pyspark对h
spark-shell 加载本地文件报错 java.io.FileNotFoundException

学习spark shell 时候发现一个问题从本地文件加载数据生成RDD 报错文件找不到原因 spark shell 如果启动了集群模式真正负责计算的executor会在该executor所在的 worker节点上读取文件并不是
【pyspark】DataFrame基础操作（二）

介绍一下 pyspark 的 DataFrame 基础操作一选择和访问数据 PySpark DataFrame 是惰性计算的简单地选择一列不会触发计算但它会返回一个 Column 实例并且大多数按列操作都返回 Column 实例
cdh下spark2-yarn运行sparkstreaming获取kafka数据使用spark-streaming-kafka-0-10_2.11报错解决

报错问题 20 07 15 17 20 51 INFO utils AppInfoParser Kafka version 0 9 0 kafka 2 0 0 20 07 15 17 20 51 INFO utils AppInfoPars
Spark中的基本概念

Spark中的基本概念 1 基本概念 1 1 RDD 弹性分布式数据集 1 2 DAG 有向无环图 1 3 Partition 数据分区 1 4 NarrowDependency 窄依赖 1 5 ShuffleDependency 宽依赖
spark dataframe 数据类型转换

文章目录 1 spark sql数据类型数字类型日期类型复杂类型 2 spark sql和scala数据类型对比 3 spark sql数据类型转换示例代码输出 1 spark sql数据类型数字类型 ByteType 代表一个
浅谈Hadoop体系和MPP体系

浅谈Hadoop体系和MPP体系引言如题在大数据发展至今为了应对日益繁多的数据分析处理和解决客户各种奇思妙怪想需求形形色色的大数据处理的框架和对应的数据存储手段层出不穷有老当益壮的Hadoop体系依靠Hadoop巨大的社
spark中repartition和coalesce的区别

总的来讲两者对是否允许shuffle 不同 coalesce numPartitions shuffle false repartition numPartitions repartition 其实是 coalesce 中参数shuff
Spark学习(文件读取路径）

在不同的启动模式下加载文件时的路径写法是不一样的对于local模式下默认就是读取本地文件而在standlone或者yarn client 或者cluster模式下默认读的都是hdfs文件系统这几种模式下很难读取本地文件这是很显
【硬刚大数据之学习路线篇】2021年从零到大数据专家的学习指南(全面升级版)

欢迎关注博客主页 https blog csdn net u013411339 本文由王知无原创首发于 CSDN博客本文首发CSDN论坛未经过官方和本人允许严禁转载欢迎点赞收藏留言欢迎留言交流声明本篇博客在我之前发表
Spark的常用概念总结

提示文章写完后目录可以自动生成如何生成可参考右边的帮助文档文章目录前言一基本概念 1 RDD的生成 2 RDD的存储 3 Dependency 4 Transformation和Action 4 1 Transformatio
sparkstreamming 消费kafka(2)

spark streaming提供了两种获取方式一种是同storm一样实时读取缓存到内存中另一种是定时批量读取这两种方式分别是 Receiver base Direct 一 Receiver base Spark官方最先提供了基于R
Spark SQL 项目：实现各区域热门商品前N统计

一需求1 1 需求简介这里的热门商品是从点击量的维度来看的计算各个区域前三大热门商品并备注上每个商品在主要城市中的分布比例超过两个城市用其他显示 1 2 思路分析使用 sql 来完成碰到复杂的需求可以使用 udf 或 udaf查
2023_Spark_实验二十九：Flume配置KafkaSink

实验目的掌握Flume采集数据发送到Kafka的方法实验方法通过配置Flume的KafkaSink采集数据到Kafka中实验步骤一明确日志采集方式一般Flume采集日志source有两种方式 1 Exec类型的Source 可

随机推荐

airpods耳机敲击没反应_为什么华强北AirPods出了那么多仿制品还有很多人不怕被坑，愿意买？...

原因很简单原版AirPods功能过于强大体验感极好设计非常细节其次华强北AirPods价格比原版优惠力度大原版AirPods外观上名称经销商 UPS QI充电器指示充电外壳采用反磁设计打开后因为磁极排斥而不会自由下落耳
Google Mock

View Edit History Content 什么是Mock Google Mock概述参考文档最简单的例子典型的流程自定义方法成员函数的期望行为我改过的例子现实中的例子 Mock protected private方法
设计模式之观察者模式

案例展示原理分析代码实现 Observer 接口观察者接口由观察者来实现 interface Observer fun update temperatrue Float pressure Float humidity Float O
详解Python中的切片（一看就懂版）

前言在我们使用Python的时候经常会听到切片这个词那什么是切片呢切片是对序列数据列表元组字符串根据下标索引对一定范围内数据的获取简单来说就是通过下标索引获取一定范围内的元素基本索引什么叫基本索引呢在Pyth
深度学习的基本概念总结

1 基本概念 1 1 为什么要使用深层网络深度神经网络的学习是特征递进的浅层的神经元只能学习一些低层次的简单特征如边缘纹理而深层神经网络可以学到更高级特征深层网络的隐藏单元数目较少隐藏层数目较多若浅层网络想达到同样的计算结果
Spring使用三级缓存解决循环依赖

前言 Spring 中使用了三级缓存的设计来解决单例模式下的属性循环依赖问题这句话有两点需要注意解决问题的方法是三级缓存的设计解决的只是单例模式下的 Bean 属性循环依赖问题对于多例 Bean 和 Prototype 作用域的
查看264NAL信息工具

用easyice只可以查看TS包的帧类型 google了一下发现了这个工具HEVCBSAnalyzer https github com latelee HEVCBSAnalyzer 从git上下载下来直接使用release文件夹下的工
java 代码扫描_静态代码扫描 (四)——Java 资源关闭研究

这是静态代码扫描系列文章的第四篇前三篇文章介绍了如何使用 PMD 和 Findbugs 自定义规则我们火线团队最近一直在研究 java 资源关闭的检查规则发现市面上开源的工具针对资源关闭的检测都存在一定不足同时也无法满足我们业务的需
matlab矩阵操作

矩阵及其操作前言一矩阵的建立二向量的产生冒号表达式结构矩阵和单元矩阵结构矩阵单元矩阵三矩阵的操作矩阵元素的引用利用冒号表达式获得一部分子矩阵删除矩阵改变矩阵的形状总结前言矩阵是matlab中最基本的数据对
SpringBoot如何配置数据库

0 写在前面项目启动需要一个链接数据库所以在此记录一下根据配置文件的后缀书写格式略有不同此处以MySQL为例使用Maven为例 1 引入依赖
Java——GUI——输入框事件监听

代码演示 package Gui import java awt import java awt event ActionEvent import java awt event ActionListener public class Tes
信息学奥赛一本通（2029：【例4.15】水仙花数）

2029 例4 15 水仙花数时间限制 1000 ms 内存限制 65536 KB 提交数 1242 通过数 718 题目描述求100 999100 999中的水仙花数若三位数ABCABC ABC A3 B3 C3ABC A3 B3
ASP.NET IIS 注册工具 (Aspnet_regiis.exe)

当在一台计算机上安装了多个 ASP NET 版本时就说 ASP NET 是并行运行的在此安装中 Internet 信息服务 IIS 需要知道哪个 ASP NET ISAPI aspnet isapi dll 版本应该处理 ASP NET
k8s集群中部署服务之Dockerfile文件准备

微服务项目各微服务Dockerfile文件准备一获取jar的方法二各微服务Dockerfile文件准备 2 1 mall auth FROM openjdk 8 ENV TZ Asia Shanghai EXPOSE 30000 V
计算机专业论文选题网站方面,5大网站汇总，搞定新颖的计算机专业毕业设计网站汇总...

原标题 5大网站汇总搞定新颖的计算机专业毕业设计网站汇总 2021年了很多计算机专业的同学都会问我不想再做XX管理系统 XX选课系统了哪里有一些新颖的毕业设计题目可以参考或者做新颖的毕业设计应该浏览哪些网站笔者根据自己的经验给
十折交叉验证10-fold cross validation, 数据集划分训练集验证集测试集

机器学习数据挖掘数据集划分训练集验证集测试集 Q 如何将数据集划分为测试数据集和训练数据集 A three ways 1 像sklearn一样提供一个将数据集切分成训练集和测试集的函数默认是把数据集的75 作为训练集把数据集
书写我的人生回忆录-这应该是给父母最好的礼物

作为一个业余的软件开发爱好者我又捣鼓了一个有意思的小东西使用完全免费哈书写我的人生回忆录是一款软件其中包含70个问题涵盖了父母的个人喜好家庭工作人生经历和态度等方面通过回答这些问题您的父母将有机会反思他们的人生并与您
Faster Rcnn 代码解读之 config.py

from future import absolute import from future import division from future import print function import os import os pat
unordered_map详解

p include stdafx h include p
spark基本知识点之内存管理

原地址点击打开链接 Spark Memory Management 25 Replies Starting Apache Spark version 1 6 0 memory management model has changed Th

spark基本知识点之内存管理

Spark Memory Management

spark基本知识点之内存管理 的相关文章

随机推荐

热门标签

spark基本知识点之内存管理的相关文章