ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICATIONS

ABSTRACT:

MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

INTRODUCTION

MapReduce has emerged as the most popular computing framework for big data processing due to its simple programming model and automatic management of parallel execution. MapReduce and its open source implementation Hadoop have been adopted by leading companies, such as Yahoo!, Google and Facebook, for various big data applications, such as machine learning bioinformatics and cybersecurity. MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result.

There is a shuffle step between map and reduce phase.

In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications. For example, with tens of thousands of machines, data shuffling accounts for 58.6% of the cross-pod traffic and amounts to over 200 petabytes in total in the analysis of SCOPE jobs. For shuffle-heavy MapReduce tasks, the high traffic could incur considerable performance overhead up to 30-40 % as shown in default, intermediate data are shuffled according to a hash function in Hadoop, which would lead to large network traffic because it ignores network topology and data size associated with each key.

We consider a toy example with two map tasks and two reduce tasks, where intermediate data of three keys K1, K2, and K3 are denoted by rectangle bars under each machine. If the hash function assigns data of K1 and K3 to reducer 1, and K2 to reducer 2, a large amount of traffic will go through the top switch. To tackle this problem incurred by the traffic-oblivious partition scheme, we take into account of both task locations and data size associated with each key in this paper. By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced. In the same example above, if we assign K1 and K3 to reducer 2, and K2 to reducer 1, as shown in Fig. 1(b), the data transferred through the top switch will be significantly reduced.

To further reduce network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced.

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

LITRATURE SURVEY

MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

AUTHOR: Dean and S. Ghemawat

PUBLISH: Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

EXPLANATION:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

CLOUDBLAST: COMBINING MAPREDUCE AND VIRTUALIZATION ON DISTRIBUTED RESOURCES FOR BIOINFORMATICS APPLICATIONS

AUTHOR: A. Matsunaga, M. Tsugawa, and J. Fortes,

PUBLISH: IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.

EXPLANATION:

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

MAP TASK SCHEDULING IN MAPREDUCE WITH DATA LOCALITY: THROUGHPUT AND HEAVY-TRAFFIC OPTIMALITY

AUTHOR: W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang

PUBLISH: INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.

EXPLANATION:

Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay.

We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little’s law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Existing problem of optimizing network usage in MapReduce scheduling in the reason that we are interested in network usage is twofold. Firstly, network utilization is a quantity of independent interest, as it is directly related to the throughput of the system. Note that the total amount of data processed in unit time is simply (CPU utilization)·(CPU capacity)+ (network utilization)·(network capacity). CPU utilization will always be 1 as long as there are enough jobs in the map queue, but network utilization can be very sensitive to scheduling network utilization has been identified as a key component in optimization of MapReduce systems in several previous works.

Network usage could lead us to algorithms with smaller mean response time. We find the main motivation for this direction of our work in the results of the aforementioned overlap between map and shuffle phases, are shown to yield significantly better mean response time than Hadoop’s fair scheduler. However, we observed that neither of these two algorithms explicitly attempted to optimize network usage, which suggested room for improvement. MapReduce has become one of the most popular frameworks for large-scale distributed computing, there exists a huge body of work regarding performance optimization of MapReduce.

For instance, researchers have tried to optimize MapReduce systems by efficiently detecting and eliminating the so-called “stragglers” providing better locality of data preventing starvation caused by large jobs analyzing the problem from a purely theoretical viewpoint of shuffle workload available at any given time is closely related to the output rate of the map phase, due to the inherent dependency between the map and shuffle phases. In particular, when the job that is being processed is ‘map-heavy,’ the available workload of the same job in the shuffle phase is upper-bounded by the output rate of the map phase. Therefore, poor scheduling of map tasks can have adverse effects on the throughput of the shuffle phase, causing the network to be idle and the efficiency of the entire system to decrease.

DISADVANTAGES:

Existing model, called the overlapping tandem queue model, is a job-level model for MapReduce where the map and shuffle phases of the MapReduce framework are modeled as two queues that are put in tandem. Since it is a job-level model, each job is represented by only the map size and the shuffle size simplification is justified by the introduction of two main assumptions. The first assumption states that each job consists of a large number of small-sized tasks, which allows us to represent the progress of each phase by real numbers.

The job-level model offers two big disadvantages over the more complicated task-level models.

Firstly, it gives rise to algorithms that are much simpler than those of task-level models, which enhances chances of being deployed in an actual system.

Secondly, the number of jobs in a system is often smaller than the number of tasks by several orders of magnitude, making the problem computationally much less strenuous note that there are still some questions to be studied regarding the general applicability of the additional assumptions of the job-level model, which are interesting research questions in their own light

PROPOSED SYSTEM:

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines in this locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

We have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

ADVANTAGES:

  • Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations.
  • Our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger.
  • Our distributed algorithm with the other two schemes a default simulation setting with a number of parameters, and then study the performance by changing one parameter while fixing others. We consider a MapReduce job with 100 keys and other parameters are the same above. the network traffic cost shows as an increasing function of number of keys from 1 to 100 under all algorithms.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Script :           Java Script
  • Tool :           Netbean 7
  • Document :           MS-Office 2007