湾区第一届调度&容器技术沙龙,可能是湾区最顶级的华人工程师聚会,了解一下

2018-04-27 11:50:55 +08:00
 AlibabaSS


Hello, Infrastructure Engineer!

Welcome to the very first event of the Bay Area Cluster Managment Meetup. Our goal is to share technical insights in this area, and get engineers connected.

We are going to hold a series of activities in Alibaba's new office in Sunnyvale, and looking forward to your warm participation. If you are interested, please click the link below to register for the exciting activities.

If you are interested in sharing your experiences – either as speaker or as user – kindly contact us: alibabass@service.alibaba.com

__Sign up now: __https://www.meetup.com/Alibaba-AIOps-Meetup/events/250165871/?_xtd=gqFyqTI0MjQ4Mzk4MqFwo3dlYg&from=ref

More details: You're Invited! Join the Bay Area Scheduler & Container Meetup

Speakers

Agenda

Talks

1

The Challenges and Possibilities for Alibaba Cluster Management System

Sigma cluster management is the core infrastructure of Alibaba that manages most online services. Through our in-house developed PouchContainer technology, Sigma forms the basis for the goal of managing the computers of Alibaba data centers as one computer. In this talk, we will introduce the goal and positioning of Alibaba cluster management system and business scenarios. We will also share the problems we have solved, the insights of our architecture design, as well as the challenges and opportunities we face and our future plans for the Alibaba cluster management.

2

PaddlePaddle Fluid: Elastic Deep Learning on Kubernetes

Industrial deep learning requires significant computation power. Traditional management systems like SLURM, MPI, and SGE do not support elastic scheduling. A job that requires 100 nodes and submitted to a cluster with 99 idle nodes would have to wait for a long time and the cluster suffers from a low utilization. PaddlePaddle EDL introduces a scheduler that implements elastic scheduling. Our scheduler considers prioritization so it can elastically schedule all kinds of jobs, e.g., web server, log collector, data processor, and deep learning, running on a general-purpose cluster, and builds a highly efficient data pipeline. The third part of our work is to make PaddlePaddle supports fault-tolerant distributed training so that killing or starting processes of a training job doesn't stop it. On a bare-metal cluster shared with the academia, we observed ~91% of general utilization, which is times higher than the average number of 18% observed from MPI and SLURM clusters.

3

The engine of Sigma: the Sigma scheduler

The sigma scheduler is a policy-rich, micro-topology-aware, workload-specific control plane component that places workload to the nodes. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, anti-affinity specifications, data locality, workload interference, and so on. The quality of the scheduler significantly impacts the overall cluster performance and utilization. In this talk, we will present the overall design principle of the sigma scheduler and its architecture. We will also explore some of the interesting functionalities that are designed to handle large scale low latency workload.

Speakers

1756 次点击
所在节点    推广
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/450342

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX