# spark_myself_commit **Repository Path**: jsqf/spark_myself_commit ## Basic Information - **Project Name**: spark_myself_commit - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: v2.3.4 - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-01-22 - **Last Updated**: 2023-12-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SparkSubmit 步骤 SparkSubmit 整体分为以下的步骤: 1. 执行 spark-submit 脚本 2. 执行 spark-class 脚本到 build_command 3. 执行 spark-class 脚本的 build_command;java 启动 java -Xmx128m -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master xx --deploy-mode cluster ... 校验和解析 命令行参数 输出 上面程序组装好的 命令,并且在后面人为添加 空格0 : CMD=${JAVA_HOME/bin/java conf/java-opts xxx -cp org.apache.spark.deploy.SparkSubmit --master xx --deploy-mode x --name xx --conf xxx=yyy --jars xx,yy,zz --class zzz userJar userArgs 0 4. 执行 spark-class 的 build_command 之后的脚本;校验上面的命令,去除后面的 空格0; shell exec $CMD 命令 5. java 启动 org.apache.spark.deploy.SparkSubmit,进入到这个 类 的生命周期里面去了 6. 在 org.apache.spark.deploy.SparkSubmit 里面 首先也会解析 参数,然后进去 submit 子选项中,接下来 prepareSubmitEnvironment 完成 sparkConf = --conf ; childClasspath = [用户jar, --jars 。。。] ; childMainClass = org.apache.spark.deploy.yarn.YarnClusterApplication childArgs = --jar 用户的jarxxx --class yyy --arg 用户参数1 --arg 用户参数2 的 组装 注意 yarn-class 模式下--conf必须包含的配置: //yarn-cluster sparkConf =:spark.master,spark.submit.deployMode,spark.app.name,spark.driver.extraClassPath=null,spark.driver.extraJavaOptions=null, //spark.driver.extraLibraryPath=null,spark.yarn.queue=null,spark.executor.instances=--num-executors,spark.yarn.dist.pyFiles=null,spark.yarn.dist.jars=--jars //spark.yarn.dist.files=--flies,spark.yarn.dist.archives=--archives,spark.yarn.principal=--principal,spark.yarn.keytab=--keytab,spark.executor.cores=--executor-cores, //spark.executor.memory=--executor-memory,spark.driver.memory=--driver-memory,spark.driver.cores=--driver-cores, //注意没有 spark.jars 和 spark.files 在 yarn-cluster 模式下 //注意这里的 conf 里面有 spark-defaults.conf 的值了 注意在这个 有一个 classLoader 会把 childClasspath 都加载进去,并把 Thread.currentThread.setContextClassLoader(loader) 最后调用 childMainClass 的start 方法。org.apache.spark.deploy.yarn.YarnClusterApplication.start(childArgs, sparkConf) 7. 进入到了 YarnClusterApplication 的 start 方法; 首先也会有一个参数解析的过程 使用 org.apache.spark.deploy.yarn.ClientArguments 解析 (childArgs, sparkConf) 的参数 --conf 中如果存在 spark.jars,spark.files 将会被移除了 然后 new Client(new ClientArguments(args), conf).run() 进入到 org.apache.spark.deploy.yarn.Client 中去了 8. Client 的 run 方法 里面 创建 yarnClient 验证yarn资源,申请 dirver的总内存 和 executor 是否 超过了 yarn 一个Container 最大允许的 memeory 创建 container 上下文,准备资源 jars,confs,classpath等 组装和安装启动命令等 创建 APP submit 上下文 提交这个 appContext 注意:组装好的命令: //javaOpts=-Xmx?m-Djava.io.tmpdir=./tmp-Dspark.yarn.app.container.log.dir= //amArgs=org.apache.spark.deploy.yarn.ApplicationMaster --class xxx --jar xxx(包含用户jar和--jars) --arg args1 --arg args2 --properties-file pwd/__spark_conf__/__spark_conf__.properties //commands = JAVA_HOME/bin/java -server javaOpts amArgs 1> /stdout 2> /stderr 8.1 当上面 yarnClient提交完之后,整个流程就进入到了 yarn集群了,可以看到 会在 yarn中首先运行 org.apache.spark.deploy.yarn.ApplicationMaster 9. 第一步 解析参数 装备 ,如 从 spark hdfs stage 的 pwd/__spark_conf__/__spark_conf__.properties 中 加载 sparkConf ,并且设置 到 system 属性中去; 根据 配置 构造 classLoader ;准备 启动 container 的 resource list 第二步 启动用户 线程 ,APPMaster 等待 用户线程 sparkContent 初始化 完成 (里面又会 启动 其他一些 后台线程 响应 contianer 的 注册 等) ,pause 用户 线程 第三步 用户线程 sparkContent 初始化 完成 ,APPMaster 线程 开始申请 containers 资源,并且 组装 classpath 和启动 命令 ,启动 containers 想 driver 注册自己 第四步 APPMaster 线程 安装 AMEndpoint 等 恢复 用户 线程; 继续刚才 pause 用户的线程 继续运行 第五步 等待 用户的线程 的完成 10. 命令行 报告 job 运行状态 用户的jar和第三方的jar 都会在 driver 和 executors 中存在的 # 典型的 spark-defaults.conf 配置 spark.authenticate=false spark.dynamicAllocation.enabled=true spark.dynamicAllocation.executorIdleTimeout=60 spark.dynamicAllocation.minExecutors=0 spark.dynamicAllocation.schedulerBacklogTimeout=1 spark.eventLog.enabled=true spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.service.enabled=true spark.shuffle.service.port=7337 spark.ui.killEnabled=true spark.master=yarn spark.submit.deployMode=client spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/* spark.sql.hive.metastore.version=1.1.0 spark.sql.catalogImplementation=hive spark.eventLog.dir=hdfs://nameservice1/user/spark/spark2ApplicationHistory spark.yarn.historyServer.address=http://njtest-cdh5-nn02.nj:18089 spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2/jars/* spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.hadoop.mapreduce.application.classpath= spark.hadoop.yarn.application.classpath= spark.yarn.config.gatewayPath=/opt/cloudera/parcels spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../.. # 典型的 spark hdfs stage __spark_conf__.properties Spark configuration. Wed Jul 15 13:50:49 CST 2020 spark.dynamicAllocation.minExecutors=0 spark.shuffle.service.enabled=true spark.yarn.secondary.jars=expression-Engine-1.0-SNAPSHOT.jar,mysql-connector-java-5.1.39.jar,spark-streaming-kafka-0-10_2.11-2.1.0.jar,kafka_2.11-0.10.0-kafka-2.1.0.jar,kafka-clients-0.10.0-kafka-2.1.0.jar spark.executor.cores=3 spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.yarn.jars=local\:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2/jars/* spark.hadoop.mapreduce.application.classpath= spark.executor.memoryoverhead=2g spark.sql.hive.metastore.jars=${env\:HADOOP_COMMON_HOME}/../hive/lib/*\:${env\:HADOOP_COMMON_HOME}/client/* spark.executor.memory=8g spark.yarn.cache.types=FILE,FILE,FILE,FILE,FILE,FILE spark.master=yarn spark.driver.memory=4g spark.hadoop.yarn.application.classpath= spark.authenticate=false spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/lib/native spark.sql.catalogImplementation=hive spark.submit.deployMode=cluster spark.dynamicAllocation.enabled=true spark.sql.hive.metastore.version=1.1.0 spark.app.name=com.saic.portrait.stream.StreamJob spark.eventLog.enabled=true spark.shuffle.service.port=7337 spark.yarn.dist.jars=file\:/home/center/script/jars/expression-Engine-1.0-SNAPSHOT.jar,file\:/home/center/script/jars/mysql-connector-java-5.1.39.jar,file\:/home/center/script/jars/spark-streaming-kafka-0-10_2.11-2.1.0.jar,file\:/home/center/script/jars/kafka_2.11-0.10.0-kafka-2.1.0.jar,file\:/home/center/script/jars/kafka-clients-0.10.0-kafka-2.1.0.jar spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE,PRIVATE,PRIVATE,PRIVATE spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../.. spark.yarn.cache.timestamps=1594792255974,1594792256542,1594792256674,1594792256750,1594792256908,1594792256973 spark.dynamicAllocation.executorIdleTimeout=60 spark.dynamicAllocation.schedulerBacklogTimeout=1 spark.yarn.cache.filenames=hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/personportrait-1.0-SNAPSHOT.jar\#__app__.jar,hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/expression-Engine-1.0-SNAPSHOT.jar\#expression-Engine-1.0-SNAPSHOT.jar,hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/mysql-connector-java-5.1.39.jar\#mysql-connector-java-5.1.39.jar,hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/spark-streaming-kafka-0-10_2.11-2.1.0.jar\#spark-streaming-kafka-0-10_2.11-2.1.0.jar,hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/kafka_2.11-0.10.0-kafka-2.1.0.jar\#kafka_2.11-0.10.0-kafka-2.1.0.jar,hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/kafka-clients-0.10.0-kafka-2.1.0.jar\#kafka-clients-0.10.0-kafka-2.1.0.jar spark.serializer=org.apache.spark.serializer.KryoSerializer spark.yarn.config.gatewayPath=/opt/cloudera/parcels spark.yarn.cache.sizes=1933911,9994102,989495,216259,5156768,747732 spark.yarn.cache.confArchive=hdfs\://nameservice1/user/center/.sparkStaging/application_1594720537016_0012/__spark_conf__.zip spark.eventLog.dir=hdfs\://nameservice1/user/spark/spark2ApplicationHistory spark.executor.instances=2 spark.ui.killEnabled=true spark.yarn.historyServer.address=http\://njtest-cdh5-nn02.nj\:18089