Sojourn time minimization of successful jobs

Yao, Yuan; Paolieri, Marco; Golubchik, Leana

doi:10.1145/3561074.3561083

Sojourn time minimization of successful jobs

Y. Yao, M. Paolieri, L. Golubchik

Abstract: Due to a growing interest in deep learning applications, compute-intensive and long-running (hours to days) training jobs have become a significant component of datacenter workloads. A large fraction of these jobs is often exploratory, with the goal of determining the best model structure (e.g., the number of layers and channels in a convolutional neural network), hyperparameters (e.g., the learning rate), and data augmentation strategies for the target application. Notably, training jobs are often terminated early if their learning metrics (e.g., training and validation accuracy) are not converging, with only a few completing successfully. For this motivating application, we consider the problem of scheduling a set of jobs that can be terminated at predetermined checkpoints with known probabilities estimated from historical data. We prove that, in order to minimize the time to complete the first K successful jobs on a single server, optimal scheduling does not require preemption (even when preemption overhead is negligible) and provide an optimal policy; advantages of this policy are quantified through simulation.

SIGMETRICS Perform. Eval. Rev., 50(2):24-26, 2022

Stochastic ProcessesTheory

copy bib | save bib | save pdf | go to publisher

🏠 Home