JOURNAL ARTICLES

Towards understanding HPC users and systems: A NERSC case study

Authors:

Gonzalo P. Rodrigo, P-O Östberg, Erik Elmroth, Katie Antypas, Richard Gerber, Lavanya Ramakrishnan

Abstract:

High performance computing (HPC) scheduling landscape currently faces new challenges due to the changes in the workload. Previously, HPC centers were dominated by tightly coupled MPI jobs. HPC workloads increasingly include high-throughput, data-intensive, and stream-processing applications. As a consequence, workloads are becoming more diverse at both application and job levels, posing new challenges to classical HPC schedulers. There is a need to understand the current HPC workloads and their evolution to facilitate informed future scheduling research and enable efficient scheduling in future HPC systems.

Power-Performance Tradeoffs in Data Center Servers: DVFS, CPU pinning, Horizontal, and Vertical Scaling

Authors:

Jakub Krzywda, Ahmed Ali-Eldin, Trevor E.Carlson, Per-Olov Östberg, Erik Elmroth

Abstract:

Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running on our local testbed, we perform a set of experiments to quantify the influence of DVFS, vertical and horizontal scaling, and CPU pinning on end-to-end response time (average and tail), throughput, and power consumption with different workloads. Results of the experiments show that DVFS rarely reduces the power consumption of underloaded servers by more than 5%, but it can be used to limit the maximal power consumption of a saturated server by up to 20% (at a cost of performance degradation). CPU pinning reduces the power consumption of underloaded server (by up to 7%) at the cost of performance degradation, which can be limited by choosing an appropriate CPU pinning scheme. Horizontal and vertical scaling improves both the average and tail response time, but the improvement is not proportional to the amount of resources added. The load balancing strategy has a big impact on the tail response time of horizontally scaled applications.

Analyzing the Availability and Performance of an E-Health System Integrated with Edge, Fog and Cloud Infrastructures

Authors:

Guto Leoni Santos, Patricia Takako Endo, Matheus Felipe Ferreira da Silva Lisboa Tigre, Leylane Graziele Ferreira da Silva, Djamel Sadok, Judith Kelner and Theo Lynn

Abstract:

The Internet of Things has the potential of transforming health systems through the collection and analysis of patient physiological data via wearable devices and sensor networks. Such systems can offer assisted living services in real-time and offer a range of multimedia-based health services. However, service downtime, particularly in the case of emergencies, can lead to adverse outcomes and in the worst case, death. In this paper, we propose an e-health monitoring architecture based on sensors that relies on cloud and fog infrastructures to handle and store patient data. Furthermore, we propose stochastic models to analyze availability and performance of such systems including models to understand how failures across the Cloud-to-Thing continuum impact on e-health system availability and to identify potential bottlenecks. To feed our models with real data, we design and build a prototype and execute performance experiments. Our results identify that the sensors and fog devices are the components that have the most significant impact on the availability of the e-health monitoring system, as a whole, in the scenarios analyzed. Our findings suggest that in order to identify the best architecture to host the e-health monitoring system, there is a trade-off between performance and delays that must be resolved.

Simulating Large vCDN Networks: A Parallel Approach

Authors:

Christos K.Filelis-Papadopoulos, Konstantinos M.Giannoutakis, George A.Gravvanis, Patricia Takako Endo, Dimitrios Tzovaras, Sergej Svorobej, Theo Lynn

Abstract:

Virtualization and cloud computing are being used by Communication Service Providers to deploy and utilize virtual Content Distribution Networks (vCDNs) to reduce costs and increase elasticity thereby avoiding performance, quality, reliability and availability limitations that characterize traditional CDNs. As cache placement is based on both the content type and geographic location of a user request, it has a significant impact on service delivery and network congestion. To study the effectiveness of cache placements and hierarchical network architectures composed of sites, a novel parallel simulation framework is proposed utilizing a discrete-time approach. Unlike other simulation approaches, the proposed simulation framework can update, in parallel, the state of sites and their resource utilization with respect to incoming requests in a significantly faster manner at hyperscale. It allows for simulations with multiple types of content, different virtual machine distributions, probabilistic caching, and forwarding of requests. In addition, power consumption models allow the estimation of energy consumption of the physical resources that host virtual machines. The results of simulations conducted to assess the performance and applicability of the proposed simulation framework are presented. Results are promising for the potential of this simulation framework in the study of vCDNs and optimization of network infrastructure.

Machine Learning Methods for Reliable Resource Provisioning in Edge-Cloud Computing: A Survey

Authors:

Thang Le Duc, Rafael García Leiva, Paolo Casari, and Per-Olov Östberg

Abstract:

Large-scale software systems are currently designed as distributed entities and deployed in cloud data centers. To overcome the limitations inherent to this type of deployments, applications are increasingly being supplemented with components instantiated closer to the edges of networks – a paradigm known as edge computing. The problem of how to efficiently orchestrate combined edge-cloud applications is however incompletely understood, and a wide range of techniques for resource and application management are currently in use. This paper investigates the problem of reliable resource provisioning in joint edge-cloud environments and surveys technologies, mechanisms, and methods that can be used to improve the reliability of distributed applications in diverse and heterogeneous network environments. Due to the complexity of the problem, special emphasis is placed on solutions to the characterization, management, and control of complex distributed applications using machine learning approaches. The survey is structured around a decomposition of the reliable resource provisioning problem into three categories of techniques: workload characterization and prediction, component placement and system consolidation, and application elasticity and remediation. Survey results are presented along with a problem-oriented discussion of the state of the art. Finally, a summary of identified challenges and an outline of future research directions are presented to conclude the paper.

A Novel Hyperparameter-free Approach to Decision Tree Construction that Avoids Overfitting by Design

Authors:

Rafael Garcia Leiva, Antonio Fernandez Anta, Vincenzo Mancuso, Paolo Casari

Abstract:

Decision trees are an extremely popular machine learning technique. Unfortunately, overfitting in decision trees still remains an open issue that sometimes prevents achieving good performance. In this work, we present a novel approach for the construction of decision trees that avoids the overfitting by design, without losing accuracy. A distinctive feature of our algorithm is that it requires neither the optimization of any hyperparameters, nor the use of regularization techniques, thus significantly reducing the decision tree training time. Moreover, our algorithm produces much smaller and shallower trees than traditional algorithms, facilitating the interpretability of the resulting models.

Simulating Fog and Edge Computing Scenarios: An Overview and Research Challenges

Authors:

Sergej Svorobej, Patricia Takako Endo, Malika Bendechache, Christos Filelis-Papadopoulos, Konstantinos M. Giannoutakis, George A. Gravvanis, Dimitrios Tzovaras, James Byrne and Theo Lynn

Abstract:

The fourth industrial revolution heralds a paradigm shift in how people, processes, things, data and networks communicate and connect with each other. Conventional computing infrastructures are struggling to satisfy dramatic growth in demand from a deluge of connected heterogeneous end points located at the edge of networks while, at the same time, meeting quality of service levels. The complexity of computing at the edge makes it increasingly difficult for infrastructure providers to plan for and provision resources to meet this demand. While simulation frameworks are used extensively in the modelling of cloud computing environments in order to test and validate technical solutions, they are at a nascent stage of development and adoption for fog and edge computing. This paper provides an overview of challenges posed by fog and edge computing in relation to simulation.

Self-service Cybersecurity Monitoring as Enabler for DevSecOps

Authors:

Jessica Diaz, Jorge E. Pérez, Miguel A. Lopez-Peña, Gabriel A. Mena, Agustín Yagüe

Abstract:

Current IoT systems are highly distributed systems that integrate cloud, edge and fog computing approaches depending on where intelligence and processing capabilities are allocated. This distribution and heterogeneity make development and deployment pipelines very complex and fragmented with multiple delivery endpoints above hardware. This fact prevents rapid development and makes the operation and monitoring of production systems a difficult and tedious task, including cybersecurity event monitoring. DevSecOps can be defined as a cultural approach to improve and accelerate the delivery of business value by making dev/sec/ops teams’ collaboration effective. This paper focuses on self-service cybersecurity monitoring as an enabler to introduce security practices in a DevOps environment. To that end, we have defined and formalized an activity that supports ‘Fast and Continuous Feedback from Ops to Dev’ by providing a flexible monitoring infrastructure so that teams can configure their monitoring and alerting services according their criteria (you build, you run, and now you monitor) to obtain fast and continuous feedback from operation and thus, better anticipate problems when a production deployment is performed. This activity has been formalized using the Software & Systems Process Engineering Metamodel by OMG and its instantiation is described through a case study that shows the versioned and repeatable configuration of a cybersecurity monitoring infrastructure (Monitoring as Code) through virtualization and containerization technology. This self-service monitoring/alerting allows breaking silos between dev, ops, and sec teams by opening access to key security metrics, which enables a sharing culture and continuous improvement.