yanqianglifei 2020-07-07
DAGScheduler主要用于在任务正式提交给TaskSchedulerImpl提交之前做一些准备工作,包括:创建job,将DAG中的RDD划分到不同的Stage,提交Stage等等。SparkContext中创建DAGScheduler的代码如下所示:
_dagScheduler = new DAGScheduler(this)
在DAGScheduler维护了jobId和StageId的关系,Stage,ActiveJob以及缓存的RDD的partition的位置信息。
代码如下:
private[scheduler] val nextJobId = new AtomicInteger(0) private[scheduler] def numTotalJobs: Int = nextJobId.get() private val nextStageId = new AtomicInteger(0) private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]] private[scheduler] val stageIdToStage = new HashMap[Int, Stage] /** * Mapping from shuffle dependency ID to the ShuffleMapStage that will generate the data for * that dependency. Only includes stages that are part of currently running job (when the job(s) * that require the shuffle stage complete, the mapping will be removed, and the only record of * the shuffle data will be in the MapOutputTracker). */ private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage] private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob] // Stages we need to run whose parents aren‘t done private[scheduler] val waitingStages = new HashSet[Stage] // Stages we are running right now private[scheduler] val runningStages = new HashSet[Stage] // Stages that must be resubmitted due to fetch failures private[scheduler] val failedStages = new HashSet[Stage] private[scheduler] val activeJobs = new HashSet[ActiveJob] /** * Contains the locations that each RDD‘s partitions are cached on. This map‘s keys are RDD ids * and its values are arrays indexed by partition numbers. Each array value is the set of * locations where that RDD partition is cached. * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). */ private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] // For tracking failed nodes, we use the MapOutputTracker‘s epoch number, which is sent with // every task. When we detect a node failing, we note the current epoch number and failed // executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results. // // TODO: Garbage collect information about failure epochs when we know there are no more // stray messages to detect. private val failedEpoch = new HashMap[String, Long] private [scheduler] val outputCommitCoordinator = env.outputCommitCoordinator // A closure serializer that we reuse. // This is only safe because DAGScheduler runs in a single thread. private val closureSerializer = SparkEnv.get.closureSerializer.newInstance()
后续完善。。。。