Latest Posts Mentioning Zhiyuan
Spring is here — and, so is the PhD-defence of Zhiyuan Yao: May 17, at 10:00 at Ecole Polytechnique
All good things must come to an end … and so, after about 3 years of scientific collaborations, a global pandemic, and 10 or so scientific papers published or in the pipeline, my PhD student Zhiyuan Yao is preparing to…
Read more
Paper: CIKM’22 – Multi-Agent Reinforcement Learning for Network Load Balancing in Data Center
Intro Glad to announce that, together with my friend at Princeton University, I have pushed a paper on multi-agent reinforcement learning algorithms for load balancing problems to CIKM. Extending based on the paper “Reinforced Workload Distribution Fairness” which we have…
Read more
Paper: Aquarius – Enable Fast, Scalable, Data-Driven Service Management in the Cloud
Intro Extending based on the paper Efficient Data-Driven Network Functions which I will physically present in the 30th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems in Nice, France, we have developed a platform…
Read more
Zhiyuan’s Publications
2022
Yao, Zhiyuan; Desmouceaux, Yoann; Cordero, Juan Antonio; Townsley, Mark; Clausen, Thomas Heide
Aquarius-Enable Fast, Scalable, Data-Driven Service Management in the Cloud Journal Article
In: IEEE Transactions on Network and Service Management, 2022, ISSN: 1932-4537.
@article{nokeyi,
title = {Aquarius-Enable Fast, Scalable, Data-Driven Service Management in the Cloud},
author = {Zhiyuan Yao and Yoann Desmouceaux and Juan Antonio Cordero and Mark Townsley and Thomas Heide Clausen},
url = {https://ieeexplore.ieee.org/abstract/document/9852806},
doi = {10.1109/TNSM.2022.3197130},
issn = {1932-4537},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
journal = {IEEE Transactions on Network and Service Management},
abstract = {In order to dynamically manage and update networking policies in cloud data centers, Virtual Network Functions (VNFs) use, and therefore actively collect, networking state information -and in the process, incur additional control signaling and management overhead, especially in larger data centers. In the meantime, VNFs in production prefer distributed and straightforward heuristics over advanced learning algorithms to avoid intractable additional processing latency under high-performance and low-latency networking constraints. This paper identifies the challenges of deploying learning algorithms in the context of cloud data centers, and proposes Aquarius to bridge the application of machine learning (ML) techniques on distributed systems and service management. Aquarius passively yet efficiently gathers reliable observations, and enables the use of ML techniques to collect, infer, and supply accurate networking state information -without incurring additional signaling and management overhead. It offers fine-grained and programmable visibility to distributed VNFs, and enables both open-and close-loop control over networking systems. This paper illustrates the use of Aquarius with a traffic classifier, an auto-scaling system, and a load balancer -and demonstrates the use of three different ML paradigms -unsupervised, supervised, and reinforcement learning, within Aquarius, for network state inference and service management. Testbed evaluations show that Aquarius suitably improves network state visibility and brings notable performance gains for various scenarios with low overhead.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Yao, Zhiyuan; Ding, Zihan
Learning Distributed and Fair Policies for Network Load Balancing as Markov Potentia Game Proceedings Article
In: 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
@inproceedings{nokeyj,
title = {Learning Distributed and Fair Policies for Network Load Balancing as Markov Potentia Game},
author = {Zhiyuan Yao and Zihan Ding},
url = {https://arxiv.org/pdf/2206.01451},
year = {2022},
date = {2022-11-28},
urldate = {2022-11-28},
booktitle = {36th Conference on Neural Information Processing Systems (NeurIPS 2022)},
abstract = {This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed, using the multi-agent reinforcement learning (MARL) framework. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments, as well as limited and partial observability of each LB agent in distributed networking systems, which can largely degrade the performance of in-production load balancing algorithms in real-world setups. Centralised-training-decentralised-execution (CTDE) RL scheme has been proposed to improve MARL performance, yet it incurs -- especially in distributed networking systems, which prefer distributed and plug-and-play design scheme -- additional communication and management overhead among agents. We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function. A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game. Experimental evaluations involve both an event-driven simulator and real-world system, where the proposed MARL load balancing algorithm shows close-to-optimal performance in simulations, and superior results over in-production LBs in the real-world system.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Yao, Zhiyuan; Desmouceaux, Yoann; Cordero, Juan Antonio; Townsley, Mark; Clausen, Thomas Heide
Efficient Data-Driven Network Functions Proceedings Article
In: 30th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2022), 2022.
@inproceedings{nokeyg,
title = {Efficient Data-Driven Network Functions},
author = {Zhiyuan Yao and Yoann Desmouceaux and Juan Antonio Cordero and Mark Townsley and Thomas Heide Clausen},
url = {https://arxiv.org/pdf/2208.11385},
year = {2022},
date = {2022-10-18},
urldate = {2022-10-18},
booktitle = {30th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2022)},
abstract = {Cloud environments require dynamic and adaptive networking policies. It is preferred to use heuristics over advanced learning algorithms in Virtual Network Functions (VNFs) in production becuase of high-performance constraints. This paper proposes Aquarius to passively yet efficiently gather observations and enable the use of machine learning to collect, infer, and supply accurate networking state information-without incurring additional signalling and management overhead. This paper illustrates the use of Aquarius with a traffic classifier, an autoscaling system, and a load balancer-and demonstrates the use of three different machine learning paradigms-unsupervised, supervised, and reinforcement learning, within Aquarius, for inferring network state. Testbed evaluations show that Aquarius increases network state visibility and brings notable performance gains with low overhead.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Yao, Zhiyuan; Ding, Zihan; Clausen, Thomas Heide
Multi-agent reinforcement learning for network load balancing in data center Proceedings Article
In: 31st ACM International Conference on Information and Knowledge Management (CIKM'22), 2022.
@inproceedings{nokeyh,
title = {Multi-agent reinforcement learning for network load balancing in data center},
author = {Zhiyuan Yao and Zihan Ding and Thomas Heide Clausen},
url = {https://www.researchgate.net/profile/Zhiyuan_Yao13/publication/358163217_Multi-Agent_Reinforcement_Learning_for_Network_Load_Balancing_in_Data_Center/links/62fe5fd3e3c7de4c34666311/Multi-Agent-Reinforcement-Learning-for-Network-Load-Balancing-in-Data-Center.pdf},
doi = {10.1145/3511808.3557133},
year = {2022},
date = {2022-10-17},
urldate = {2022-10-17},
booktitle = {31st ACM International Conference on Information and Knowledge Management (CIKM'22)},
abstract = {This paper presents the network load balancing problem, a challenging real-world task for multi-agent reinforcement learning (MARL) methods. Conventional heuristic solutions like Weighted-Cost Multi-Path (WCMP) and Local Shortest Queue (LSQ) are less flexible to the changing workload distributions and arrival rates, with a poor balance among multiple load balancers. The cooperative network load balancing task is formulated as a Dec-POMDP problem, which naturally induces the MARL methods. To bridge the reality gap for applying learning-based methods, all models are directly trained and evaluated on a real-world system from moderate- to large-scale setups. Experimental evaluations show that the independent and “selfish” load balancing strategies are not necessarily the globally optimal ones, while the proposed MARL solution has a superior performance over different realistic settings. Additionally, the potential difficulties of the application and deployment of MARL methods for network load balancing are analysed, which helps draw the attention of the learning and network communities to such challenges.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Yao, Zhiyuan; Desmouceaux, Yoann; Cordero, Juan Antonio; Clausen, Thomas Heide
HLB: Towards Load-Aware Load-Balancing Journal Article
In: IEEE/ACM Transactions on Networking, 2022, ISSN: 1558-2566.
@article{nokey,
title = {HLB: Towards Load-Aware Load-Balancing},
author = {Zhiyuan Yao and Yoann Desmouceaux and Juan Antonio Cordero and Thomas Heide Clausen},
doi = {10.1109/TNET.2022.3177163},
issn = {1558-2566},
year = {2022},
date = {2022-06-05},
urldate = {2022-06-05},
journal = {IEEE/ACM Transactions on Networking},
abstract = {The purpose of network load balancers is to optimize quality of service to the users of a set of servers - basically, to improve response times and to reducing computing resources - by properly distributing workloads. This paper proposes a distributed, application-agnostic, Hybrid Load Balancer (HLB) that - without explicit monitoring or signaling - infers server occupancies and processing speeds, which allows making optimised workload placement decisions. This approach is evaluated both through simulations and extensive experiments, including synthetic workloads and Wikipedia replays on a real-world testbed. Results show significant performance gains, in terms of both response time and system utilisation, when compared to existing load-balancing algorithms.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2021
Yao, Zhiyuan; Ding, Zihan; Clausen, Thomas Heide
Reinforced Workload Distribution Fairness Proceedings Article
In: Machine Learning for Systems at 35th Conference on Neural Information Processing Systems (NeurIPS 2021), 2021.
@inproceedings{yao2021reinforced,
title = {Reinforced Workload Distribution Fairness},
author = {Zhiyuan Yao and Zihan Ding and Thomas Heide Clausen},
url = {https://www.thomasclausen.net/wp-content/uploads/2021/11/2111.00008-1.pdf},
year = {2021},
date = {2021-12-01},
urldate = {2021-12-01},
booktitle = {Machine Learning for Systems at 35th Conference on Neural Information Processing Systems (NeurIPS 2021)},
abstract = {Network load balancers are central components in data centers, that distributes workloads across multiple servers and thereby contribute to offering scalable services. However, when load balancers operate in dynamic environments with limited monitoring of application server loads, they rely on heuristic algorithms that require manual configurations for fairness and performance. To alleviate that, this paper proposes a distributed asynchronous reinforcement learning mechanism to-with no active load balancer state monitoring and limited network observations-improve the fairness of the workload distribution achieved by a load balancer. The performance of proposed mechanism is evaluated and compared with stateof-the-art load balancing algorithms in a simulator, under configurations with progressively increasing complexities. Preliminary results show promise in RLbased load balancing algorithms, and identify additional challenges and future research directions, including reward function design and model scalability.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Yao, Zhiyuan; Desmouceaux, Yoann; Townsley, Mark; Clausen, Thomas Heide
Towards Intelligent Load Balancing in Data Centers Proceedings Article
In: Machine Learning for Systems at 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Dec 2021, Sydney, Australia, 2021.
@inproceedings{yao2021intelligent,
title = {Towards Intelligent Load Balancing in Data Centers},
author = {Zhiyuan Yao and Yoann Desmouceaux and Mark Townsley and Thomas Heide Clausen},
url = {https://www.thomasclausen.net/wp-content/uploads/2021/11/2110.15788.pdf},
year = {2021},
date = {2021-12-01},
urldate = {2021-12-01},
booktitle = {Machine Learning for Systems at 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Dec 2021, Sydney, Australia},
abstract = {Network load balancers are important components in data centers to provide scalable services. Workload distribution algorithms are based on heuristics, e.g., Equal-Cost Multi-Path (ECMP), Weighted-Cost Multi-Path (WCMP) or naive machine learning (ML) algorithms, e.g., ridge regression. Advanced ML-based approaches help achieve performance gain in different networking and system problems. However, it is challenging to apply ML algorithms on networking problems in real-life systems. It requires domain knowledge to collect features from low-latency, high-throughput, and scalable networking systems, which are dynamic and heterogenous. This paper proposes Aquarius to bridge the gap between ML and networking systems and demonstrates its usage in the context of network load balancers. This paper demonstrates its ability of conducting both offline data analysis and online model deployment in realistic systems. The results show that the ML model trained and deployed using Aquarius improves load balancing performance yet they also reveals more challenges to be resolved to apply ML for networking systems.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Rizzi, Carmine; Yao, Zhiyuan; Desmouceaux, Yoann; Townsley, Mark; Clausen, Thomas Heide
Charon: Load-Aware Load-Balancing in P4 Proceedings Article
In: 1st Joint International Workshop on Network Programmability & Automation (NetPA) at 17th International Conference on Network and Service Management (CNSM 2021),, 2021.
@inproceedings{rizzi2021charon,
title = {Charon: Load-Aware Load-Balancing in P4},
author = {Carmine Rizzi and Zhiyuan Yao and Yoann Desmouceaux and Mark Townsley and Thomas Heide Clausen},
url = {https://www.thomasclausen.net/wp-content/uploads/2021/11/2110.14389.pdf},
year = {2021},
date = {2021-10-01},
urldate = {2021-01-01},
booktitle = {1st Joint International Workshop on Network Programmability & Automation (NetPA) at 17th International Conference on Network and Service Management (CNSM 2021),},
abstract = {Load-Balancers play an important role in data centers as they distribute network flows across application servers and guarantee per-connection consistency. It is hard however to make fair load balancing decisions so that all resources are efficiently occupied yet not overloaded. Tracking connection states allows to infer server load states and make informed decisions, but at the cost of additional memory space consumption. This makes it hard to implement on programmable hardware, which has constrained memory but offers line-rate performance. This paper presents Charon, a stateless load-aware load balancer that has line-rate performance implemented in P4-NetFPGA. Charon passively collects load states from application servers and employs the power-of-2-choices scheme to make data-driven load balancing decisions and improve resource utilization. Perconnection consistency is preserved statelessly by encoding server ID in a covert channel. The prototype design and implementation details are described in this paper. Simulation results show performance gains in terms of load distribution fairness, quality of service, throughput and processing latency.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}