Optimal Configuration of Network Coding in Ad Hoc Networks

05/08/201902/07/2019 by admin

Abstract:

Analyze the impact of network coding (NC) configuration on the performance of ad hoc networks with the consideration of two significant factors, namely, the throughput loss and the decoding loss, which are jointly treated as the overhead of NC. In particular, physical-layer NC and random linear NC are adopted in static and mobile ad hoc networks (MANETs), respectively. Furthermore, we characterize the good put and delay/good put tradeoff in static networks, which are also analyzed in MANETs for different mobility models (i.e., the random independent and identically distributed mobility model and the random walk model) and transmission schemes.

Introduction:

Network coding was initially designed as a kind of Source coding. Further studies showed that the Capacity of wired networks can be improved by network coding (NC), which can fully utilize the network resources.

Due to This advantage, how to employ NC in wireless ad hoc networks has been intensively studied in recent years with the Purpose of improving the throughput and delay performance. The main difference between wired networks and wireless Networks is that there is non ignorable interference between Nodes in wireless networks.

Therefore, it is important to design the NC in wireless ad hoc networks with interference to achieve the improvement on system performance such as good put and delay/good put tradeoff.

Existing System:

The probability that the random linear NC was valid for a multicast connection problem on an arbitrary network with independent sources was at least (1 − d/q)η, where η was the number of links with associated random coefficients, d was the number of receivers, and q was the size of Galois field Fq.

It was obvious that a large q was required to guarantee that the system with RLNC was valid. When considering the given two factors, the traditional definition of throughput in ad hoc networks is no longer appropriate since it does not consider the bits of NC coefficients and the linearly correlated packets that do not carry any valuable data. Instead, the good put and the delay/good put tradeoff are investigated in this paper, which only take into account the successfully decoded data.

Moreover, if we treat the data size of each packet, the generation size (the number of packets that are combined by NC as a group), and the NC coefficient Galois field as the configuration of NC, it is necessary to find the scaling laws of the optimal configuration for a given network model and transmission scheme.

Disadvantages:

Throughput loss.
The decoding loss.
Time delay.

Proposed System:

Proposed system with the basic idea of NC and the scaling laws of throughput loss and decoding loss. Furthermore, some useful concepts and parameters are listed. Finally, we give the definitions of some network performance metrics.

Physical layer Network Coding designed based on the channel state information (CSI) and network topology. The PNC is appropriate for the static networks since the CSI and network topology are preknown in the static case.

There are G nodes in one cell, and node i (i = 1, 2, . . . , G) holds packet xi. All of the G packets are independent, and they belong to the same unicast session. The packets are transmitted to a node i’ in the next cell simultaneously. gii’ is a complex number that represents the CSI between i and i’ in the frequency domain.

Advantages:

System minimizes data loss.
System reduces time delay.

Modules:

Network Topology:

The networks that consist of n randomly and evenly distributed static nodes in a unit square area. These nodes are randomly grouped into S–D pairs.

Transmission Model:

The protocol model, which is a simplified version of the physical model since it ignores the long-distance interference and transmission. Moreover, it is indicated in that the physical model can be treated as the protocol model on scaling laws when the transmission is allowed if the signal-to-interference-plus-noise ratio is larger than a given threshold.

Transmission Schemes for Mobile Networks:

When the relay receives a new packet, it combines the packet it has with that it receives by randomly selected coefficients and then generates a new packet. Simultaneous transmission in one cell is not allowed since it is hard for the receiver to obtain multiple CSI from different transmitters at the same time. Hence, we employ the random linear NC for mobile models.

Conclusion:

Analyzed the NC configuration in both static and mobile ad hoc networks to optimize the delay good put tradeoff and the good put with the consideration of the

Through put loss and decoding loss of NC. These results reveal the impact of network scale on the NC system, which has not been studied in previous works. Moreover, we also compared the performance with the corresponding networks without NC.

On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications

05/08/201902/07/2019 by admin

MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.2 INTRODUCTION

MapReduce has emerged as the most popular computing framework for big data processing due to its simple programming model and automatic management of parallel execution. MapReduce and its open source implementation Hadoop have been adopted by leading companies, such as Yahoo!, Google and Facebook, for various big data applications, such as machine learning bioinformatics and cybersecurity. MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result.

There is a shuffle step between map and reduce phase.

In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications. For example, with tens of thousands of machines, data shuffling accounts for 58.6% of the cross-pod traffic and amounts to over 200 petabytes in total in the analysis of SCOPE jobs. For shuffle-heavy MapReduce tasks, the high traffic could incur considerable performance overhead up to 30-40 % as shown in default, intermediate data are shuffled according to a hash function in Hadoop, which would lead to large network traffic because it ignores network topology and data size associated with each key.

We consider a toy example with two map tasks and two reduce tasks, where intermediate data of three keys K1, K2, and K3 are denoted by rectangle bars under each machine. If the hash function assigns data of K1 and K3 to reducer 1, and K2 to reducer 2, a large amount of traffic will go through the top switch. To tackle this problem incurred by the traffic-oblivious partition scheme, we take into account of both task locations and data size associated with each key in this paper. By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced. In the same example above, if we assign K1 and K3 to reducer 2, and K2 to reducer 1, as shown in Fig. 1(b), the data transferred through the top switch will be significantly reduced.

To further reduce network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced.

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.3 LITRATURE SURVEY

MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

AUTHOR: Dean and S. Ghemawat

PUBLISH: Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

EXPLANATION:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

CLOUDBLAST: COMBINING MAPREDUCE AND VIRTUALIZATION ON DISTRIBUTED RESOURCES FOR BIOINFORMATICS APPLICATIONS

AUTHOR: A. Matsunaga, M. Tsugawa, and J. Fortes,

PUBLISH: IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.

EXPLANATION:

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

MAP TASK SCHEDULING IN MAPREDUCE WITH DATA LOCALITY: THROUGHPUT AND HEAVY-TRAFFIC OPTIMALITY

AUTHOR: W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang

PUBLISH: INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.

EXPLANATION:

Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay.

We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little’s law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing problem of optimizing network usage in MapReduce scheduling in the reason that we are interested in network usage is twofold. Firstly, network utilization is a quantity of independent interest, as it is directly related to the throughput of the system. Note that the total amount of data processed in unit time is simply (CPU utilization)·(CPU capacity)+ (network utilization)·(network capacity). CPU utilization will always be 1 as long as there are enough jobs in the map queue, but network utilization can be very sensitive to scheduling network utilization has been identified as a key component in optimization of MapReduce systems in several previous works.

Network usage could lead us to algorithms with smaller mean response time. We find the main motivation for this direction of our work in the results of the aforementioned overlap between map and shuffle phases, are shown to yield significantly better mean response time than Hadoop’s fair scheduler. However, we observed that neither of these two algorithms explicitly attempted to optimize network usage, which suggested room for improvement. MapReduce has become one of the most popular frameworks for large-scale distributed computing, there exists a huge body of work regarding performance optimization of MapReduce.

For instance, researchers have tried to optimize MapReduce systems by efficiently detecting and eliminating the so-called “stragglers” providing better locality of data preventing starvation caused by large jobs analyzing the problem from a purely theoretical viewpoint of shuffle workload available at any given time is closely related to the output rate of the map phase, due to the inherent dependency between the map and shuffle phases. In particular, when the job that is being processed is ‘map-heavy,’ the available workload of the same job in the shuffle phase is upper-bounded by the output rate of the map phase. Therefore, poor scheduling of map tasks can have adverse effects on the throughput of the shuffle phase, causing the network to be idle and the efficiency of the entire system to decrease.

2.1.1 DISADVANTAGES:

Existing model, called the overlapping tandem queue model, is a job-level model for MapReduce where the map and shuffle phases of the MapReduce framework are modeled as two queues that are put in tandem. Since it is a job-level model, each job is represented by only the map size and the shuffle size simplification is justified by the introduction of two main assumptions. The first assumption states that each job consists of a large number of small-sized tasks, which allows us to represent the progress of each phase by real numbers.

The job-level model offers two big disadvantages over the more complicated task-level models.

Firstly, it gives rise to algorithms that are much simpler than those of task-level models, which enhances chances of being deployed in an actual system.

Secondly, the number of jobs in a system is often smaller than the number of tasks by several orders of magnitude, making the problem computationally much less strenuous note that there are still some questions to be studied regarding the general applicability of the additional assumptions of the job-level model, which are interesting research questions in their own light

2.2 PROPOSED SYSTEM:

MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines in this locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

We have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

2.2.1 ADVANTAGES:

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations.

Our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger.

Our distributed algorithm with the other two schemes a default simulation setting with a number of parameters, and then study the performance by changing one parameter while fixing others. We consider a MapReduce job with 100 keys and other parameters are the same above. the network traffic cost shows as an increasing function of number of keys from 1 to 100 under all algorithms.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tool : Netbean 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

SERVER

Access Layer

Cross Layer

Use Hash Partition

Traffic Aware Partition

Send data through head node

Mapper

RECEIVER

Aggregation Layer

Map Reducer

OHRA

OHNA

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

Source

Destination

Establish connection

Send the data

Data send into destination

Data Aggregation Layer

Receive data

Neighbor Nodes

View data

Base station

Form the cluster

3.3 CLASS DIAGRAM:

Source

Base station

System Address

Data Send ()

Data send

Data info

Destination address

Data Send

Transmitting ()

Destination

System Address ()

Maintain Details

Verify ()

Receive data ()

View data ()

Connection ()

Move Nodes

Node info

Data length

Hop routing ()

3.4 SEQUENCE DIAGRAM:

Connection established

Send data Data Aggregation Layer Form routing Routing Finished Traffic Aware Partition Connection terminate Source Base station Destination Establish communication Connection established Receiving Ack Data received Map Reducer

3.5 ACTIVITY DIAGRAM:

Source

Destination

False

Receive data

View data

True

False

Connection establish

Send data

Aggregation Node

Receive Ack

True

Using Mapper

Data transfer

Map Reducer

Base station

CHAPTER 4

4.0 IMPLEMENTATION:

ONLINE EXTENSION OF HRA AND HNA

In this section, we conduct extensive simulations to evaluate the performance of our proposed distributed algorithm DA. We compare DA with HNA, which is the default method in Hadoop. To our best knowledge, we are the first to propose the aggregator placement algorithm, and compared with the HRA that focuses on a random aggregator placement. All simulation results are averaged over 30 random instances.

• HNA: Hash-based partition with No Aggregation. It exploits the traditional hash partitioning for the intermediate data, which are transferred to reducers without going through aggregators. It is the default method in Hadoop.

• HRA: Hash-based partition with Random Aggregation. It adds a random aggregator placement algorithm based on the traditional Hadoop. Through randomly placing aggregators in the shuffle phase, it aims to reducing the network traffic cost in the comparison of traditional method in Hadoop.

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations. Each key associated with random data size within [1-50]. There are 20 mappers, and 2 reducers on a cluster of 20 machines. The parameter α is set to 0.5. The distance between any two machines is randomly chosen within [1-60]. As shown in Fig. 7, the performance of our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger. When the number of keys is set to 10, the default algorithm HNA has a cost of 5.0 × 104 while optimal solution is only 2.7×104 , with 46% traffic reduction.

4.1 ALGORITHM

DISTRIBUTED ALGORITHM

The problem above can be solved by highly efficient approximation algorithms, e.g., branch-and-bound, and fast off-the-shelf solvers, e.g., CPLEX, for moderate-sized input. An additional challenge arises in dealing with the MapReduce job for big data. In such a job, there are hundreds or even thousands of keys, each of which is associated with a set of variables (e.g., x p ij and y p k ) and constraints in our formulation, leading to a large-scale optimization problem that is hardly handled by existing algorithms and solvers in practice.

ONLINE ALGORITHM

We take the data size m p i and data aggregation ratio αj as input of our algorithms. In order to get their values, we need to wait all mappers to finish before starting reduce tasks, or conduct estimation via profiling on a small set of data. In practice, map and reduce tasks may partially overlap in execution to increase system throughput, and it is difficult to estimate system parameters at a high accuracy for big data applications. These motivate us to design an online algorithm to dynamically adjust data partition and aggregation during the execution of map and reduce tasks.

4.2 MODULES:

SERVER CLIENTS:

DITRIBUTED DATA:

SHEDULING TASK:

NETWORK TRAFFIC TRACES:

MAPREDUCE TASK:

4.3 MODULE DESCRIPTION:

SERVER CLIENTS:

Client-server computing or networking is a distributed application architecture that partitions tasks or workloads between service providers (servers) and service requesters, called clients. Often clients and servers operate over a computer network on separate hardware. A server machine is a high-performance host that is running one or more server programs which share its resources with clients. A client also shares any of its resources; Clients therefore initiate communication sessions with servers which await (listen to) incoming requests.

DITRIBUTED DATA:

We develop a distributed algorithm to solve the problem on multiple machines in a parallel manner. Our basic idea is to decompose the original large-scale problem into several distributively solvable subproblems that are coordinated by a high-level master problem. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

SHEDULING TASK:

MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result. There is a shuffle step between map and reduce phase. In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications.

NETWORK TRAFFIC TRACES:

Network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combiner has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced. We tested the real network traffic cost in Hadoop using the real data source from latest dumps files in Wikimedia (http://dumps.wikimedia.org/enwiki/latest/). In the meantime, we executed our distributed algorithm using the same data source for comparison. Since our distributed algorithm is based on a known aggregation ratio _, we have done some experiments to evaluate it in Hadoop environment.

MAPREDUCE TASK:

We focus on MapReduce performance improvement by optimizing its data transmission optimizing network usage can lead to better system performance and found that high network utilization and low network congestion should be achieved simultaneously for a job with good performance. MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

To overcome this shortcoming, we have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. In addition to data partition, many efforts have been made on local aggregation, in-mapper combining and in-network aggregation to reduce network traffic within MapReduce jobs. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks have proposed a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

In this paper, we study the joint optimization of intermediate data partition and aggregation in MapReduce to minimize network traffic cost for big data applications. We propose a three-layer model for this problem and formulate it as a mixed-integer nonlinear problem, which is then transferred into a linear form that can be solved by mathematical tools. To deal with the large-scale formulation due to big data, we design a distributed algorithm to solve the problem on multiple machines. Furthermore, we extend our algorithm to handle the MapReduce job in an online manner when some system parameters are not given. Finally, we conduct extensive simulations to evaluate our proposed algorithm under both offline cases and online cases. The simulation results demonstrate that our proposals can effectively reduce network traffic cost under various network settings.
CHAPTER 9

Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care

05/08/201902/07/2019 by admin

Abstract—Intelligently extracting knowledge from social mediahas recently attracted great interest from the Biomedical andHealth Informatics community to simultaneously improve healthcareoutcomes and reduce costs using consumer-generated opinion.We propose a two-step analysis framework that focuses on positiveand negative sentiment, as well as the side effects of treatment, inusers’ forum posts, and identifies user communities (modules) andinfluential users for the purpose of ascertaining user opinion ofcancer treatment. We used a self-organizing map to analyze wordfrequency data derived from users’ forum posts. We then introduceda novel network-based approach for modeling users’ foruminteractions and employed a network partitioning method based onoptimizing a stability qualitymeasure.This allowed us to determineconsumer opinion and identify influential users within the retrievedmodules using information derived frombothword-frequency dataand network-based properties. Our approach can expand researchinto intelligently mining social media data for consumer opinionof various treatments to provide rapid, up-to-date information forthe pharmaceutical industry, hospitals, and medical staff, on theeffectiveness (or ineffectiveness) of future treatments.Index Terms—Datamining, complex networks, neural networks,semantic web, social computing.I. INTRODUCTIONSOCIAL media is providing limitless opportunities for patientsto discuss their experiences with drugs and devices,and for companies to receive feedback on their products andservices [1]–[3]. Pharmaceutical companies are prioritizing socialnetwork monitoring within their IT departments, creatingan opportunity for rapid dissemination and feedback of productsand services to optimize and enhance delivery, increase turnoverand profit, and reduce costs [4]. Social media data harvestingfor bio-surveillance have also been reported [5].Social media enables a virtual networking environment.Modelingsocial media using available network modeling and computationaltools is one way of extracting knowledge and trendsfrom the information ‘cloud:’ a social network is a structuremade of nodes and edges that connect nodes in various relationships.Graphical representation is the most common methodto visually represent the information. Network modeling couldManuscript received January 24, 2014; revised May 4, 2014 and June 19,2014; accepted June 30, 2014. Date of publication July 10, 2014; date of currentversion December 30, 2014.A. Akay and B.-E. Erlandsson are with the School of Technology andHealth, Royal Institute of Technology, Stockholm SE-141 52, Sweden (e-mail:altu@kth.se; bjorn-erik.erlandsson@sth.kth.se).A. Dragomir, is with the Department of Biomedical Engineering, Universityof Houston, Houston, TX 77204–5060 USA (e-mail: adragomir@uh.edu).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JBHI.2014.2336251also be used for studying the simulation of network propertiesand its internal dynamics.A sociomatrix can be used to construct representations ofa social network structure. Node degree, network density, andother large-scale parameters can derive information about theimportance of certain entities within the network. Such communitiesare clusters or modules. Specific algorithms can performnetwork-clustering, one of the fundamental tasks in networkanalysis. Detecting particular user communities requires identifyingspecific, networked nodes that will allow informationextraction. Healthcare providers could use patient opinion toimprove their services. Physicians could collect feedback fromother doctors and patients to improve their treatment recommendationsand results. Patients could use other consumers’knowledge in making better-informed healthcare decisions.The nature of social networks makes data collection difficult.Several methods have been employed, such as link mining [6],classification through links [7], predictions based on objects[8], links [9], existence [10], estimation [11], object [12], group[13], and subgroup detection [14], and mining the data [15],[16]. Link prediction, viral marketing, online discussion groups(and rankings) allow for the development of solutions based onuser feedback.Traditional social sciences use surveys and involve subjectsin the data collection process, resulting in small sample sizes perstudy.With social media, more content is readily available, particularlywhen combined with web-crawling and scraping softwarethat would allow real-time monitoring of changes withinthe network.Previous studies used technical solutions to extract user sentimenton influenza [17], technology stocks [18], context andsentence structure [19], online shopping [20], multiple classifications[21], government health monitoring [22], specific termsrelating to consumer satisfaction [23], polarity of newspaper articles[24], and assessment of user satisfaction from companies[25], [26]. Despite the extensive literature, none have identifiedinfluential users, and how forum relationships affect networkdynamics.In the first stage of our current study, we employ exploratoryanalysis using the self-organizing maps (SOMs) to assess correlationsbetween user posts and positive or negative opinionon the drug. In the second stage, we model the users and theirposts using a network-based approach. We build on our previousstudy [27] and use an enhanced method for identifying usercommunities (modules) and influential users therein. The currentapproach effectively searches for potential levels of organization(scales) within the networks and uncovers dense modules2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 211Fig. 1. Processing tree in Rapidminer to ascertain the TF-IDF scores of wordsin the datausing a partition stability quality measure [28]. The approach enablesus to find the optimal network partition. We subsequentlyenrich the retrieved modules with word frequency informationfrom module-contained users posts to derive local and globalmeasures of users opinion and raise flag on potential side effectsof Erlotinib, a drug used in the treatment of one of the mostprevalent cancers: lung cancer [29].II. METHODSA. Initial Data Search and CollectionWe first searched for the most popular cancer message boards.We initially focused on the number of posts on lung cancer. Thechart below gives the number of posts of lung cancer per forum:Forums Posts on Lung CancerCancer-forums.net 36 051cancerforums.net 34 328forums.stupidcancer.org 17csn.cancer.org/forum 7959We chose lung cancer because, according to the most recentstatistics, it is the most commonly diagnosed cancer in theworld for both sexes [30], and the second most prevalent cancerin the US between both sexes [31], [32]. We then compiled alist of drugs used by lung cancer patients to ascertain whichdrug was the most discussed in the forums. The drug Erlotinib(trade name Tarceva) was the most frequently discussed drugin the message boards. A further search revealed that Cancerforums.net, despite having slightly fewer posts on lung cancer, hadmore posts dedicated to Erlotinib than the other three messageboards mentioned above.Next, we performed a search of the drug, using both thetrade name (Tarceva) and drug name (Erlotinib). The trade namegarnered more results (498) compared to the drug name (66).The search using the trade name returned 920 posts, from 2009to the present date.B. Initial Text Mining and PreprocessingA Rapidminer (www.rapidminer.com) [33] data collectionand processing tree was developed to look for the most commonpositive and negative words, and their term-frequency-inversedocument frequency (TF-IDF) scores within each post. Fig. 1shows the data collection and processing tree. We initially uploadedthe data into the first component (‘Read Excel’). Theuploaded data was then processed in the second component(‘Process Documents to Data’) using several subcomponents(‘Extract Content’, ‘Tokenize’, ‘Transform Cases’, ‘Filter Stopwords’,‘Filter Tokens,’ respectively) that filtered excess noise(misspelled words, common stop words, etc.) to ensure a uniformset of variables that can be measured. The final component(‘Processed Data’) contained the final word list, with each wordcontaining a specific TF-IDF score.We then assigned weights for each of the words found in theuser posts using with the following formula:weighti,d_log tfi,d + 1) log nxt0if tft,d ≥ 1otherwisein which tfi,d represents the word frequency (t) in the document(d), n represents the number of documents within the entirecollection, and xt represents the number of documents where toccurs [30].C. Cataloging and Tagging Text DataText data containing the highest TF-IDF scores were taggedwith a modified NLTK toolkit (http://www.nltk.org/) [34] usingMATLAB to ensure that they reflected the negativity of a negativeword and the positivity of a positive word in context. Thisapproach was used before using negative tags on positive words[35]. We added a positive tag on negative words. We used theNLTK toolkit for the analysis, and classification, of words tomatch their exact meanings within the contextual settings. Forexample, the context should be considered in phrases such as ‘Ido not feel great’ so that the term ‘great’ would be adequatelytagged as a negative one (in our case it is tagged as ‘great_n’before it is returned to its specific position). Das and Chen useda similar approach in classifying words [18]. We went one stepfurther and considered positive tag on negative words. A sentencethat states ‘No side effects so I am happy!’ resulted in theword ‘No’ being tagged as ‘No_p’ (reflecting its positive context)before it is returned to its specific position. These taggedwords were thus reclassified based on the context of the post.We then reduced the number of similar words, both manually(checking the words using online dictionaries such asMerriam-Webster (http://www.merriam-webster.com/), and automatically(synonym database software such as the ThesaurusSynonym Database (http://www.language-databases.com/) andGoogle’s synonym search finder.Our finalwordlistwas pruned using the aforementioned methods,with the results displayed in Table I, with the division ofboth positive and negative words.We eliminated each word that appeared less than ten times.This allowed us to achieve a uniform set of measurements whileeliminating statistically insignificant outliers. The end resultwas a modified wordlist of 110 words (55 positive words and55 negative words) shown in Table I.In a parallel procedure, we automatically browsed the userposts to look for side effects of Erlotinib. To this goal, weused the National Library of Medicine’s Medical SubjectHeading (MeSH), which is a controlled vocabulary212 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IFINAL POSTANALYSIS WORDLISTPositive NegativeAgree BadAppreciate CannotBeneficial ConcernBenefit ConcernsComfort DamageComfortable DangerousEase DepressionEasier DidnEffective DiedEnjoy DifficultFavorable DiscomfortFavorably Don’tFeasible DoubtGood ErrorGrateful FailureGreat FearGreater HardGreatest HasnGreatly HateHelp HurtHelped ImpossibleHelpful IsnHelping LackHelps LimitedHonest LoseHonestly LossHope LostHoped MissHopeful NastyHopefully NauseaHopes NegativeHoping NoImportance NotImportant PainImportantly PainfulImpresses PoorImprove ProblemImproved ProblemsImprovement SadImproves SacredImproving ScaryInspiration SevereLike SorryLove SucksLoved SufferPositive SufferingRight TerribleSuccess UnableSuccessful UnfortunatelySupport WasnThank WeakThanks WorriedUseful WorseWell WorstWonderful Wrong(http://www.nlm.nih.gov/mesh/) that consists of a hierarchy ofdescriptors and qualifiers that are used to annotate medicalterms. A custom designed program was used to map wordsin the forum to the MeSH database. A list of words present inforum posts that were associated to treatment side effects wasthus compiled. This was done by selecting the words simultaneouslyannotated with a specific list of qualifiers in MeSH (CI– chemically induced; CO – complications; DI – diagnosis; PA– pathology, and PP – physiopathology).We then compared theTABLE IIFINAL SIDE EFFECTS WORDLISTAcneCachexiaHeadachesItchingLesionPneumoniaRashTremorWeaknessVomitingFig 2. Thread model where nodes represent users/posts and the edges representinformation transferred among users.full list of side effect words with the results that were fed into theRapidminer processing tree: we kept the side effect words withthe highest TF-IDF scores (ensuring that each word appeared atleast ten times in the forum posts).Table II shows the final wordlist of the side effects. We subjectedthe initial side effect wordlist with the same methods thatwere used in Table I.After these preprocessing steps, our forum data was representedas two sets of vectors containing the TF-IDF scores ofthe words in the two wordlist. Namely, each user post in theforum was thus transformed into a vector of 110 variables representingthe TF-IDF scores of positive and negative words, anda 10 variable vector containing the TF-IDF scores correspondingto the side effect terms (see Fig. 2, steps A1-A3).D. Consumer Sentiment Using a SOMFor this part of the analysis, all posts were manually labeledaccording to the general user opinion observed within the postas positive and negative before feeding the collected data forexploratory analysis via SOMs. The manual labeling allowed usto use this as a method of results validation.SOMs are neural networks that produce low-dimensional representationof high-dimensional data [33]. Within this network,a layer represents output space with each neuron assigned a specificweight. The weight values reflect on the cluster content.The SOM displays the data to the network, bringing togethersimilar data weights to similar neurons.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 213The benefits and capabilities have been demonstrated wheredespite the reduction of the space size, the information, andidentification schema of the clusters remained the same [36].When new data is fed into the network, the closest weightsmatching the data change to reflect the new data. The neuronsfarther from the new data rarely change. This process continuesuntil data is no longer fed, resulting in a two-dimensional map.The SOM toolbox (www.cis.hut.fi/projects/somtoolbox) [37]was used and the SOM was fed with our first wordlist (seeTable I) TF-IDF vectors. The purposewas to assess the existenceof clusters in the data and howtheSOMweights of these clusterswould correlate to positive and negative opinion. The SOMwas trained using various map sizes, using quantization andtopographic errors as validation measures. The former is theresult of the average distance between every input vector andits best matching neuron (BMN), in addition to measuring howthe trained map fits into the input data [33]. The latter uses thestructure of the map to preserve its topology by representingits accuracy: it is calculated using the proportion of the weightsfor the first and second BMNs are farther than required formeasuring the topology.The best map size was based on the minimum values of thequantization (0.24) and topographic (10−5) errors. The wordlistdata was mapped and the emerging weights were analyzed forpositive and negative variable correlations of thewordlist.Wordsof no interest, and groups containing three or fewer words, wereeliminated.Subgroups were visually identified and analyzed for furtherinformation on the consumer opinion of Erlotinib.E. Modeling Forum Postings Using Network AnalysisDiscovering influential users was the next step in our analysis.To this goal, we built networks from forum posts andtheir replies, while accounting for content-based grouping ofposts resulting from the existing forum threads. Networks arecomposed of nodes and their connections: they are either nondirectional(a connection between two nodes without a direction)or directional (a connection with an origin and an end). Thenodal degree of the latter measures the number of connectionsfrom the origin to the destination. Four node types have beenidentified [38] within a network: Isolated, transmitter, receptor,and carrier. The network’s density measures the current numberof connections.The network-based analysis is widely used in social networkanalysis based on its ability to both model and analyze intersocialdynamics. We devised a directional network model due tothe nature of the forum under scrutiny (multiple threads withmultiple thread initiators) and its internal dynamics among themembers (members reply to thread initiators as well as to otherusers). Fig. 2 describes the approach we chose to build ournetwork, which shows how each posting-reply pair is modeled.Based on the nature of the forum, all of the posters within eachthread are context posters for the thread initiator (e.g., Node 1 isthe thread initiator in Fig. 2 and Nodes 2, 3, 4, and 5 in representcontext posters). Thus, all of the posters receive an incomingedge from the thread initiator. Some context posters respondFig. 3. Diagram describing the framework of our network-based analysis.First, the posts collected from the forum via Rapidminer are preprocessed usingthe NTLK Toolbox (Step A1) and transformed into two wordlists (Step A2). Forthis step, direct mapping to the MeSH vocabulary is used to identify words representingside-effects Based on the two wordlists, forum posts are transformedinto numerical vectors containing word-frequency based TF-IDF scores (StepA3). In parallel, forum posts and replies aremodeled as a directed network (StepB1). Obtained network is further refined to identify communities/modules ofhighly interacting users, based on the MCSD method [28] (Step B2). Finally,the two wordlist vectors datasets (their info reflecting the forum informationcontent) are overlaid onto the network modules to identify influential users andhighlight side-effects intensively discussed within the modules, respectively(Step B3).directly to another poster, using the forum option ‘Reply.’ Weused bidirectional edges to reflect the ensuing information transferfrom the poster to the replier and vice versa (in Fig. 2, Node5 is a direct replier to Node 4, as is Node 3 to Node 2). Thisuser-interactionmodel allowed us to build a network that reflectsfaithfully the information content of the forum.F. Identifying SubgraphsOur modeling framework has consequently converted the forumposts into several large directional networks containing anumber of densely connected units (or modules) (see Fig. 3,step A1). These modules have the characteristic that they aremore densely connected internally (within the unit) than externally(outside the unit). We chose a multiscale method thatuses local and global criteria for identifying the modules, whilemaximizing a partition quality measure called stability [28].The stability measure considers the network as a Markovchain, with nodes representing states and edges being possibletransitions among these states. In [28], the authors proposed anapproach in which transition probabilities for a random walk oflength t (t being the Markov time) enable multiscale analysis.With increasing scale t, larger and larger modules are found.The stability of a walk of length t can be expressed asQMt =12m_i,j_Ati , j− didj2m_∗ δ (i, j) (1)where At is the adjacency matrix, t is the length of the network,m is the number of edges, i and j are nodes, di is node i’s (and j’s)strength, and δ (i,j) function becomes one if one of the nodesbelong to the same network and zero if it does not belong to214 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015any network. At is computed as follows (in order to accountfor the random walk): At = D ·Mt , where M = D−1 · A (Dbeing the diagonal matrix containing the degree vector givingfor each node its degree) [28].The method for identifying the optimal modules is based onalternating local and global criteria that expand modules byadding neighbor nodes, reassigning nodes to different modules,and significantly overlapping modules until no further optimizationis feasible, according to (1). The approach follows similarmethods presented in [28], [39], and [40].Several partitioning schemes were obtained pending on therange of scales employed by the method, with the optimal partitioninghaving the largest stability. We named the modules thusretrieved information modules (see step A2 in Fig. 3).G. Module Average Opinion and User Average OpinionWe then proceeded to refine the information modules throughfeeding them with the information obtained from the forumposts (using the wordlist vectors). In a first step, we aimedat identifying influential users within our networks. Influentialusers are users which broker most of the information transferwithin network modules and whose opinion in terms of positiveor negative sentiment towards the treatment is ‘spread’ tothe other users within their containing modules. To this goal,we enriched the information modules obtained as described inSection II-F with the TF-IDF scores of the user posts correspondingto the users found in each module. The TF-IDF scoresfrom the wordlist of positive and negative words (see Table I)were used to build two forms of measurement. The global measure(pertaining to the whole informationmodule) is representedby the module average opinion (MAO). It examined the TF-IDFscores of postings matching the nodes in a specific moduleMAO =Sum+ − Sum−Sumall.Sum+ =__xij is the total sum of the TF-IDF scoresmatching the positive words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the positive words in the list).Sum− =__xij is the total sum of the TF-IDF scoresmatching the negative words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the negative words in the list).Sumall =_Ni=1_Mk=1 xik is the sum of both of the aforementionedsums. The unit k is the index running across variablesthroughout the entire wordlist.The local measure that illustrates specific user opinion toeach node in the module (the user average opinion, or UAO)that examines the TF-IDF scores to the average of the collectedposts of the user is the following:UAOi =Sumi+ − Sumi−Sumiall.Sumi+ =_j∈P xij is the TF-IDF score’s sum matching topositive words for the ith user’s wordlist vector. P is the indexset denoting the wordlist’s positive variables.Fig. 4. U-Matrix of the posts from the Cancerforums.net forum.Sumi− =_j∈N xij is the TF-IDF score’s sum matching tonegative words for the ith user’s wordlist vector. N is the indexset denoting the wordlist’s negative variables.Sumall =_Mj=1 xij is the total of both sums, and j is theindex of the whole wordlist.H. Information Brokers Within the Information ModulesWe first ranked individual nodes in terms of their total numberof connecting edges (in and out-degree) to identify influentialusers within the modules.We then looked nodes in each module based on the followingcriteria:1) The nodes have densest degrees within the module (highestnumber of edges).2) The UAO scores equate the signs of the MAO of thecontaining module.The nodes that qualified were dubbed information brokers,based on the aforementioned criteria. Their large nodal degreesensure increased information transfer compared to other nodeswhile their matching UAO and MAO scores reflect consistencyof positive or negative opinion within the containing module.I. Network-Based Identification of Side EffectsIn the second step of our network-based analysis, we devised astrategy for identifying potential side effects occurring duringthe treatment and which user posts on the forum highlight. Tothis goal, we overlay the TF-IDF scores of the second wordlist(see Table II) onto modules obtained in Section II-F. The TFIDFscores within each module will thus directly reflect howfrequent a certain side-effect is mentioned in module posts.Subsequently, a statistical test (such as the t-test for example)can be used to compare the values of the TF-IDF scores withinthe module to those of the overall forum population and identifyvariables (side-effects) that have significantly higher scores.Fig. 3 presents a diagram that visually describes the steps inour network-based analysis.III. RESULTSFig. 4 shows the unified matrix resulting from the SOM analysisfor the wordlist vectors corresponding to the positive andnegative terms from the message board Cancerforums.net. Asubset consisting of 30% of the data was used for training theSOM. We used a 12 × 12 map size with 110 variables correspondingto the positive and negative terms to ascertain theAKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 215TABLE IIIUSER OPINION OF ERLOTINIBSatisfaction Dissatisfaction70 percent 30 percentBREAKDOWN OF USER OPINIONFully Satisfied (23) Full Dissatisfaction (4)Satisfied Despite Side Effects (37) Dissatisfaction because of Side Effects (20)Satisfied Despite Costs (10) Dissatisfaction because of Costs (6)weight of the words corresponded to the opinion of the drugErlotinib. As mentioned in Section II, each word from the listappeared more than ten times. This achieved a uniform measurementset while eliminating statistically insignificant outliers.Much of the user’s posts converged on three areas of the map.We checked the respective nodes’ correlation with their weightvectors’ values corresponding to positive or negative words todefine the positive and negative areas of the map.The user opinion of Erlotinib was overall satisfactory, withTable III summarizing the satisfaction/dissatisfaction below:According to chart, and from our readings of both the userposts and the SOM, the most pressing concern from both campswas the side effects, which are extensively documented in themedical literature [41]–[46]. The costs of the drug were alsoanother matter of concern (albeit limited).We then proceeded to identify influential users. Our modelingapproach yielded initially a single loosely connected network,linking all users within the forum. Subsequent module identificationusing the methods described in Section II-F yielded anoptimal partitioning containing five densely connected module.We varied our scale parameter within the interval t _ [0,2] in0.1 increments, as suggested by [28]. Varying the scale parameterresulted in a set of partitions ranging from modules basedon single individual users (for scale parameter t = 0), to largemodules (for values of t close to the upper limit of the interval).The optimal partition (maximizing the quality measure in (1)was obtained for t = 1.On the Cancerforums.net message board, ten users out of the920 posts were identified as information brokers as shown inFig. 5(a)–(e) below.Densities of the retrieved modules range from 0.2 to 0.6.These density values were within the observed density valuesinterval (towards the upper limit), when compared to those generallynoted in social networks, thus confirming our networkmodeling approach [47], [48]. Information brokers were identifiedfollowing the procedure described in Sections II-G–H.Further scrutinizing these users and their containing moduleswe confirmed their connections were the densest. A thoroughreading of these ten users’ posts throughout the threads theystarted and participated in revealed that they were informativeand actively interacting with users across many threads. Othermembers sought out these ten posters for their wisdom andexperience. Their forum ‘behavior’ has confirmed to us thatthese users were the premier information brokers of the drugsErlotinib on the Cancerforums.net forum.Fig. 5. Ten users were identified as information brokers on the Cancerforums.net Forum. Modules in parts a)–e) show where these ten users reside inthe forum.216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IVSIDE-EFFECT FREQUENCY AND LOCATION IN SELECTED MODULESModule 1 (A) ‘rash’ (p − value < 0.01)‘itch’ (p − value < 0.05)Module 2 (B) ‘rash’ (p − value < 0.05)Module 5 (E) ‘rash’ (p − value < 0.01)In the last part of our analysis,we investigated whichmoduleswere significantly involved in discussing specific side effects.As described in Section II-I, retrieved modules were enrichedwith the TF-IDF scores corresponding to the side-effectwordlistvectors. For each module and each side-effect scores sample, ttestswereperformed to assess the significant difference betweenthe in-module sample and the overall forum population scores.Rash and itching were identified as the side effect terms withsignificantly higher scores in Modules 1, 2, and 5 when comparedto the overall scores population in the forum, as describedin Table IV. This reflects the fact that users grouped within thesemodules repeatedly discussed these side effects in their posts.This was confirmed by subsequent scrutiny of the respectiveposts. A literature search confirmed that rash and itching areindeed two of the most common side-effects of Erlotinib withas much as 70% of the patients affected, as indicated by clinicalstudies. [44]–[46]IV. DISCUSSIONWe converted a forum focused on oncology into weightedvectors to measure consumer thoughts on the drug Erlotinibusing positive and negative terms alongside another list containingthe side effects. Our methods were able to investigatepositive and negative sentiment on lung cancer treatment usingthe drug by mapping the large dimensional data onto a lowerdimensional space using the SOM. Most of the user data wasclustered to the area of themap linked to positive sentiment, thusreflecting the general positive view of the users. Subsequent networkbased modeling of the forum yielded interesting insightson the underlying information exchange among users. Modulesof strongly interacting users were identified using a multiscalecommunity detection method described in [28]. By overlayingthese modules with content-based information in the form ofword-frequency scores retrieved from user posts, we were ableto identify information brokers which seem to play importantroles in the shaping the information content of the forum. Additionally,we were able to identify potential side effects consistentlydiscussed by groups of users. Such an approach could beused to raise red flags in future clinical surveillance operations,as well as highlighting various other treatment related issues.The results have opened new possibilities into developing advancedsolutions, as well as revealing challenges in developingsuch solutions.The consensus on Erlotinib depends on individual patientexperience. Social media, by its nature, will bring different individualswith different experiences and viewpoints. We siftedthrough the data to find positive and negative sentiment, whichwas later confirmed by research that emerged regarding Erlotinib’seffectiveness and side effects. Future studies will requiremore up-to-date information for a clearer picture of userfeedback on drugs and services.Future solutions will require more advanced detection of intersocialdynamics and its effects on the members: such interestsof study may include rankings, ‘likes’ of posts, and friendships.Further emphasis on context posting will require formal languagedictionaries that include medical terms for specific diseases,and informal language terms (‘slang’) to clarify posts.Finally, different platforms will allow up-to-date informationon the status of the drug in case one social platform ceases todiscuss the drug. Another solution can look at multiple wordliststhat can include multiple treatments that, when combined withcontextual posting and medical lexical dictionaries, can pinpointthe source (or multiple sources) of user satisfaction (ordissatisfaction), which can open the door towards mapping consumersentiment of multidrug therapies for advanced diseases.The combined solutions can open newavenues of postmarketingsurveillance research as companies seek real-time, ‘intelligent’data of their products and services to remain competitive.This solution can be envisioned on future medical devicesthat can serve as postmarketing feedback loop that consumerscan use to express their satisfaction (or dissatisfaction) directlyto the company. The company benefits from real-time feedbackthat can then be used to assess if there are any problems andrapidly address such problems.Social media can open the door for the health care sector inaddress cost reduction, product and service optimization, andpatient care.

Neighbor Similarity Trust against Sybil Attack in P2P E-Commerce

05/08/201902/07/2019 by admin

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

1.2 INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

1.3 LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

2.1.0 DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

Fake Users Enters Easy.
This makes Sybil attacks.

2.2 PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

2.2.0 ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

It is Helpful to find Sybil Attacks.
It is used to Find Fake UserID.
It is feasible to limit the number of attack edges in online social networks by relationship rating.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.0 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.0 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tools : Netbeans 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGNS

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

LEVEL 0:

Neighbor Nodes

Source

LEVEL 1:

P2P Sybil Trust Mode

Send Data Request

LEVEL 2:

Data Receive

P2P ACK

Active Attack (Malicious Node)

Send Data Request

LEVEL 3:

3.3 UML DIAGRAMS

3.3.0 USECASE DIAGRAM:

SERVER CLIENT

3.3.1 CLASS DIAGRAM:

3.3.2 SEQUENCE DIAGRAM:

3.4 ACITVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group peers join a group; they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has. Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating. The method of detection of Sybil attack is depicted in Fig. 2. A1 and A2 refer to the same peer but with different identities.

Our approach, the identifiers are only propagated by the peers who exhibit neighbor similarity trust. Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers. SybilTrust proposes that an honest peer should not have an excessive number of neighbors. The neighbors we refer should be member peers existing in a group. The restriction helps to bind the number of peers against any additional attack among the neighbors. If there are too many neighbors, SybilTrust will (internally) only use a subset of the peer’s edges while ignoring all others. Following Liben-Nowell and Kleinberg, we define the attributes of the given pair of peers as the intersection of the sets of similar products.

4.1 MODULES:

SIMILARITY TRUST RELATIONSHIP:

NEIGHBOR SIMILARITY TRUST:

DETECTION OF SYBIL ATTACK:

SECURITY AND PERFORMANCE:

4.2 MODULES DESCRIPTION:

SIMILARITY TRUST RELATIONSHIP:

We focus on the active attacks in P2P e-commerce. When a peer is compromised, all the information will be extracted. In our work, we have proposed use of SybilTrust which is based on neighbor similarity relationship of the peers. SybilTrust is efficient and scalable to group P2P e-commerce network. Sybil attack peers may attempt to compromise the edges or the peers of the group P2P e-commerce. The Sybil attack peers can execute further malicious actions in the network. The threat being addressed is the identity active attacks as peers are continuously doing the transactions in the peers to show that each controller only admitted the honest peers.

Our method makes assumptions that the controller undergoes synchronization to prove whether the peers which acted as distributor of identifiers had similarityor not. If a peer never had similarity, the peer is assumed to have been a Sybil attack peer. Pairing method is used to generate an expander graph with expansion factor of high probability. Every pair of neighbor peers share a unique symmetric secret key (the edge key), established out of band for authenticating each other peers may deliberately cause Byzantine faults in which their multiple identity and incorrect behavior ends up undetected.

The Sybil attack peers can create more non-existent links. The protocols and services for P2P, such as routing protocols must operate efficiently regardless of the group size. In the neighbor similarity trust, peers must have a self-healing in order to recover automatically from any state. Sybil attack can defeat replication and fragmentation performed in distributed hash tables. Geographic routing in P2P can also be a routing mechanism which can be compromised by Sybil peers.

NEIGHBOR SIMILARITY TRUST:

We present a Sybil identification algorithm that takes place in a neighbor similarity trust. The directed graph has edges and vertices. In our work, we assume V is the set of peers and E is the set of edges. The edges in a neighbor similarity have attack edges which are safeguarded from Sybil attacks. A peer u and a Sybil peer v can trade whether one is Sybil or not. Being in a group, comparison can be done to determine the number of peers which trade with peer. If the peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes a peer existing in a group has six types of keys. The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete. Our algorithm adaptively tests the suspected peer while maintaining the neighbor similarity trust connection based on time.

DETECTION OF SYBIL ATTACK:

Sybil attack, a malicious peer must try to present multiple distinct identities. This can be achieved by either generating legal identities or by impersonating other normal peers. Some peers may launch arbitrary attacks to interfere with P2P e-commerce operations, or the normal functioning of the network. According to an attack can succeed to launch a Sybil attack by:

_ Heterogeneous configuration: in this case, malicious peers can have more communication and computation resources than the honest peers.

_ Message manipulation: the attacker can eavesdrop on nearby communications with other parties. This means a attacker gets and interpolates information needed to impersonate others. Major attacks in P2P e-commerce can be classified as passive and active attacks.

_ Passive attack: It listens to incoming and outgoing messages, in order to infer the relevant information from the transmitted recommendations, i.e., eavesdropping, but doesn’t harm the system. A peer can be in passive mode and later in active mode.

_ Active attack: When a malicious peer receives a recommendation for forwarding, it can modify, or when requested to provide recommendations on another peer, it can inflate or bad mouth. The bad mouthing is a situation where a malicious peer may collude with other malicious peers to revenge the honest peer. In the Sybil attack, a malicious peer generates a large number of identities and uses them together to disrupt normal operation.

SECURITY AND PERFORMANCE:

We evaluate the performance of the proposed SybilTrust. We measure two metrics, namely, non-trustworthy rate and detection rate. Non-trustworthy rate is the ratio of the number of honest peers which are erroneously marked as Sybil/malicious peer to the number of total honest peers. Detection rate is the proportion of detected Sybil/ malicious peers to the total Sybil/malicious peers. Communication Cost. The trust level is sent with the recommendation feedback from one peer to another. If a peer is compromised, the information is broadcasted to all peers as revocation of the trust level is being done. Computation Cost. The sybilTrust approach is efficient in the computation of polynomial evaluation. The calculation of the trust level evaluation is based on a pseudo-random function (PRF). PRF is a deterministic function.

In our simulation, we use C# .NET tool. Each honest and malicious peer interacted with a random number of peers defined by a uniform distribution. All the peers are restricted to the group. Our approach, P2P e-commerce community has a total of 3 different categories of interest. The transaction interactions between peers with similar interest can be defined as successful or unsuccessful, expressed as positive or negative respectively. The impact of the first two parameters on performance of the mechanism is evaluated in the percentage of malicious peers replied is randomly chosen by each malicious peer. Transactions with 10 to 40 percent malicious peers are done.

Our SybilTrust approach detects more malicious peers compared to Eigen Trust and Eigen Group Trust [26] as shown in Fig. 4. Fig. 4. shows the detection rates of the P2P when the number of malicious peers increases. When the number of deployed peers is small, e.g., 40 peers, the chance that no peers are around a malicious peer is high. Fig. 4 illustrates the variation of non-trustworthy rates of different numbers of honest peers as the number of malicious peer increases. It is shown that the non-trustworthy rate increases as the number of honest peers and malicious peers increase. The reason is that when there are more malicious peers, the number of target groups is larger. Moreover, this is because neighbor relationship is used to categorize peers in the

We proposed approach. The number of target-groups also increases when the number of honest peers is higher. As a result, the honest peers are examined more times, and the chance that an honest peer is erroneously determined as a Sybil/malicious peer increases, although more Sybil attack peer can also be identified. Fig. 4 displays the detection rate when the reply rate of each malicious peer is the same. The detection rate does not decrease when the reply rate is more than 80 percent, because of the enhancement.

The enhancement could still be found even when a malicious peer replies to almost all of its Sybil attack peer requests. Furthermore, the detection rate is higher as the number of malicious peers becomes more, which means the proposed mechanism is able to resist the Sybil attack from more malicious peers. The detection rate is still more than 80 percent in the sparse network, which according to the definition of a sparse network detection rate reaches 95 percent when the number of legitimate nodes is 300. It is also because the number of target groups increases as the number of malicious peer’s increases and the honest peers are examined more times. Therefore, the rate that an honest peer is erroneously identified as a Sybil/malicious peer also increases.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET Framework is a language-neutral platform for writing programs that can easily and securely interoperate. There’s no language barrier with .NET: there are numerous languages available to the developer including Managed C++, C#, Visual Basic and Java Script.

The .NET framework provides the foundation for components to interact seamlessly, whether locally or remotely on different platforms. It standardizes common data types and communications protocols so that components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

The code that targets .NET, and which contains certain extra Information – “metadata” – to describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed code contains the information that allows the CLR to guarantee, for instance, safe execution and interoperability.

Managed Data

With Managed Code comes Managed Data. CLR provides memory allocation and Deal location facilities, and garbage collection. Some .NET languages use Managed Data by default, such as C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR can, depending on the language you’re using, impose certain constraints on the features available. As with managed and unmanaged code, one can have both managed and unmanaged data in .NET applications – data that doesn’t get garbage collected but instead is looked after by unmanaged code.

Common Type System

The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety. This ensures that all classes are compatible with each other, by describing types in a common way. CTS define how types work within the runtime, which enables types in one language to interoperate with types in another language, including cross-language exception handling. As well as ensuring that types are only used in appropriate ways, the runtime also ensures that code doesn’t attempt to access memory that hasn’t been allocated to it.

Common Language Specification

The CLR provides built-in support for language interoperability. To ensure that you can develop managed code that can be fully used by developers using any programming language, a set of language features and rules for using them called the Common Language Specification (CLS) has been defined. Components that follow these rules and expose only CLS features are considered CLS-compliant.

7.3 THE CLASS LIBRARY

.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as well as Object. All objects derive from System. Object. As well as objects, there are value types. Value types can be allocated on the stack, which can provide useful flexibility. There are also efficient means of converting value types to object types if and when necessary.

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

The multi-language capability of the .NET Framework and Visual Studio .NET enables developers to use their existing programming skills to build all types of applications and XML Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to the family.

Visual Basic .NET has been updated to include many new and improved language features that make it a powerful object-oriented programming language. These features include inheritance, interfaces, and overloading, among others. Visual Basic also now supports structured exception handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Managed Extensions for C++ and attributed programming are just some of the enhancements made to the C++ language. Managed Extensions simplify the task of migrating existing C++ applications to the new .NET Framework.

C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid Application Development”. Unlike other languages, its specification is just the grammar of the language. It has no standard library of its own, and instead has been designed with the intention of using the .NET libraries as its own.

Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the world of XML Web Services and dramatically improves the interoperability of Java-language programs with existing software written in a variety of other programming languages.

Active State has created Visual Perl and Visual Python, which enable .NET-aware applications to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET environment. Visual Perl includes support for Active State’s Perl Dev Kit.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

C#.NET is also compliant with CLS (Common Language Specification) and supports structured exception handling. CLS is set of rules and constructs that are supported by the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET Framework; it manages the execution of the code and also makes the development process easier by providing services.

C#.NET is a CLS-compliant language. Any objects, classes, or components that created in C#.NET can be used in any other CLS-compliant language. In addition, we can use objects, classes, and components created in other CLS-compliant languages in C#.NET .The use of CLS ensures complete interoperability among applications, regardless of the languages used to create the application.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them. In other words, destructors are used to release the resources allocated to the object. In C#.NET the sub finalize procedure is available. The sub finalize procedure is used to complete the tasks that must be performed when an object is destroyed. The sub finalize procedure is called automatically when an object is destroyed. In addition, the sub finalize procedure can be called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION

Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated resources, such as objects and variables. In addition, the .NET Framework automatically releases memory for reuse by destroying objects that are no longer in use.

In C#.NET, the garbage collector checks for the objects that are not currently in use by applications. When the garbage collector comes across an object that is marked for garbage collection, it releases the memory occupied by the object.

OVERLOADING

Overloading is another feature in C#. Overloading enables us to define multiple procedures with the same name, where each procedure has a different set of arguments. Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

C#.NET also supports multithreading. An application that supports multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease the time taken by an application to respond to user interaction.

STRUCTURED EXCEPTION HANDLING

C#.NET supports structured handling, which enables us to detect and remove errors at runtime. In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using Try…Catch…Finally statements, we can create robust and effective exception handlers to improve the performance of our application.

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000 Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component. The Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data Services. The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the question from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset. Access either displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION AND FUTURE:

We presented SybilTrust, a defense against Sybil attack in P2P e-commerce. Compared to other approaches, our approach is based on neighborhood similarity trust in a group P2P e-commerce community. This approach exploits the relationship between peers in a neighborhood setting. Our results on real-world P2P e-commerce confirmed fastmixing property hence validated the fundamental assumption behind SybilGuard’s approach. We also describe defense types such as key validation, distribution, and position verification. This method can be done at in simultaneously with neighbor similarity trust which gives better defense mechanism. For the future work, we intend to implement SybilTrust within the context of peers which exist in many groups. Neighbor similarity trust helps to weed out the Sybil peers and isolate maliciousness to specific Sybil peer groups rather than allow attack in honest groups with all honest peers.

Maximizing P2P File Access Availability in Mobile Ad Hoc Networks though Replication for Efficient Fi

05/08/201902/07/2019 by admin

File sharing applications in mobile ad hoc networks (MANETs) have attracted more and more attention in recent years. The efficiency of file querying suffers from the distinctive properties of such networks including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica creation with minimum average querying delay.

Specifically, current file replication protocols in mobile ad hoc networks have two shortcomings. First, they lack a rule to allocate limited resources to different files in order to minimize the average querying delay. Second, they simply consider storage as available resources for replicas, but neglect the fact that the file holders’ frequency of meeting other nodes also plays an important role in determining file availability. Actually, a node that has a higher meeting frequency with others provides higher availability to its files. This becomes even more evident in sparsely distributed MANETs, in which nodes meet disruptively.

In this paper, we introduce a new concept of resource for file replication, which considers both node storage and node meeting ability. We theoretically study the influence of resource allocation on the average querying delay and derive an optimal file replication rule (OFRR) that allocates resources to each file based on its popularity and size. We then propose a file replication protocol based on the rule, which approximates the minimum global querying delay in a fully distributed manner. Our experiment and simulation results show the superior performance of the proposed protocol in comparison with other representative replication protocols.

1.2 INTRODUCTION

With the increasing popularity of mobile devices, e.g., smartphones and laptops, we envision the future of MANETs consisted of these mobile devices. By MANETs, we refer to both normal MANETs and disconnected MANETs, also known as delay tolerant networks (DTNs). The former has a relatively dense node distribution in an area while the latter has sparsely distributed nodes that meet each other opportunistically. On the other side, the emerging of mobile file sharing applications on the peer-to-peer (P2P) file sharing over such MANETs. The local P2P file sharing model provides three advantages. First, it enables file sharing when no base stations are available (e.g., in rural areas). Second, with the P2P architecture, the bottleneck on overloaded servers in current clientserver based file sharing systems can be avoided. Third, it exploits otherwise wasted peer to peer communication opportunities among mobile nodes. As a result, nodes can freely and unobtrusively access and share files in the distributed MANET environment, which can possibly support interesting applications.

For example, mobile nodes can share files based on users’ proximity in the same building or in a local community. Tourists can share their travel experiences or emergency information with other tourists through digital devices directly even when no base station is available in remote areas. Drivers can share road information through the vehicle-to-vehicle communication. However, the distinctive properties of MANETs, i.e., node mobility, limited communication range and resource, have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range. Broadcasting can quickly discover files, but it leads to the broadcast storm problem with high energy consumption.

Probabilistic routing and file discovery protocols avoid broadcasting by forwarding a query to a node with higher probability of meeting the destination. But the opportunistic encountering of nodes in MANETs makes file searching and retrieval non-deterministic. File replication is an effective way to enhance file availability and reduce file querying delay. It creates replicas for a file to improve its probability of being encountered by requests. Unfortunately, it is impractical and inefficient to enable every node to hold the replicas of all files in the system considering limited node resources. Also, file querying delay is always a main concern in a file sharing system. Users often desire to receive their requested files quickly no matter whether the files are popular or not. Thus, a critical issue is raised for further investigation: how to allocate the limited resource in the network to different files for replication so that the overall average file querying delay is minimized? Recently, a number of file replication protocols have been proposed for MANETs. In these protocols, each individual node replicates files it frequently queries or a group of nodes create one replica for each file they frequently query. In the former, redundant replicas are easily created in the system, thereby wasting resources.

In the latter, though redundant replicas are reduced by group based cooperation, neighboring nodes may separate from each other due to node mobility, leading to large query delay. There are also some works addressing content caching in disconnected MANETs/ DTNs for efficient data retrieval or message routing. They basically cache data that are frequently queried on places that are visited frequently by mobile nodes. Both the two categories of replication methods fail to thoroughly consider that a node’s mobility affects the availability of its files. In spite of efforts, current file replication protocols lack a rule to allocate limited resources to files for replica creation in order to achieve the minimum average querying delay, i.e., global search efficiency optimization under limited resources. They simply consider storage as the resource for replicas, but neglect that a node’s frequency to meet other nodes (meeting ability in short) also influences the availability of its files. Files in a node with a higher meeting ability have higher availability.

1.3 LITRATURE SURVEY

CONTACT DURATION AWARE DATA REPLICATION IN DELAY TOLERANT NETWORKS

AUTHOR: X. Zhuo, Q. Li, W. Gao, G. Cao, and Y. Dai

PUBLISH: Proc. IEEE 19th Int’l Conf. Network Protocols (ICNP), 2011.

EXPLANATION:

The recent popularization of hand-held mobile devices, such as smartphones, enables the inter-connectivity among mobile users without the support of Internet infrastructure. When mobile users move and contact each other opportunistically, they form a Delay Tolerant Network (DTN), which can be exploited to share data among them. Data replication is one of the common techniques for such data sharing. However, the unstable network topology and limited contact duration in DTNs make it difficult to directly apply traditional data replication schemes. Although there are a few existing studies on data replication in DTNs, they generally ignore the contact duration limits. In this paper, we recognize the deficiency of existing data replication schemes which treat the complete data item as the replication unit, and propose to replicate data at the packet level. We analytically formulate the contact duration aware data replication problem and give a centralized solution to better utilize the limited storage buffers and the contact opportunities. We further propose a practical contact Duration Aware Replication Algorithm (DARA) which operates in a fully distributed manner and reduces the computational complexity. Extensive simulations on both synthetic and realistic traces show that our distributed scheme achieves close-to-optimal performance, and outperforms other existing replication schemes.

SOCIAL-BASED COOPERATIVE CACHING IN DTNS: A CONTACT DURATION AWARE APPROACH

AUTHOR: X. Zhuo, Q. Li, G. Cao, Y. Dai, B.K. Szymanski, and T.L. Porta,

PUBLISH: Proc. IEEE Eighth Int’l Conf. Mobile Adhoc and Sensor Systems (MASS), 2011.

EXPLANATION:

Data access is an important issue in Delay Tolerant Networks (DTNs), and a common technique to improve the performance of data access is cooperative caching. However, due to the unpredictable node mobility in DTNs, traditional caching schemes cannot be directly applied. In this paper, we propose DAC, a novel caching protocol adaptive to the challenging environment of DTNs. Specifically, we exploit the social community structure to combat the unstable network topology in DTNs. We propose a new centrality metric to evaluate the caching capability of each node within a community, and solutions based on this metric are proposed to determine where to cache. More importantly, we consider the impact of the contact duration limitation on cooperative caching, which has been ignored by the existing works. We prove that the marginal caching benefit that a node can provide diminishes when more data is cached. We derive an adaptive caching bound for each mobile node according to its specific contact patterns with others, to limit the amount of data it caches. In this way, both the storage space and the contact opportunities are better utilized. To mitigate the coupon collector’s problem, network coding techniques are used to further improve the caching efficiency. Extensive trace-driven simulations show that our cooperative caching protocol can significantly improve the performance of data access in DTNs.

SEDUM: EXPLOITING SOCIAL NETWORKS IN UTILITY-BASED DISTRIBUTED ROUTING FOR DTNS

AUTHOR: Z. Li and H. Shen

PUBLISH: IEEE Trans. Computers, vol. 62, no. 1, pp. 83-97, Jan. 2012.

EXPLANATION:

However, current probabilistic forwarding methods only consider node contact frequency in calculating the utility while neglecting the influence of contact duration on the throughput, though both contact frequency and contact duration reflect the node movement pattern in a social network. In this paper, we theoretically prove that considering both factors leads to higher throughput than considering only contact frequency. To fully exploit a social network for high throughput and low routing delay, we propose a Social network oriented and duration utility-based distributed multicopy routing protocol (SEDUM) for DTNs. SEDUM is distinguished by three features. First, it considers both contact frequency and duration in node movement patterns of social networks. Second, it uses multicopy routing and can discover the minimum number of copies of a message to achieve a desired routing delay. Third, it has an effective buffer management mechanism to increase throughput and decrease routing delay. Theoretical analysis and simulation results show that SEDUM provides high throughput and low routing delay compared to existing routing approaches. The results conform to our expectation that considering both contact frequency and duration for delivery utility in routing can achieve higher throughput than considering only contact frequency, especially in a highly dynamic environment with large routing messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

This work focuses on Delay Tolerant Networks (DTNs) in a social network environment. DTNs do not have a complete path from a source to a destination most of the time. Previous data routing approaches in DTNs are primarily based on either flooding or single-copy routing. However, these methods incur either high overhead due to excessive transmissions or long delays due to suboptimal choices for relay nodes. Probabilistic forwarding that forwards a message to a node with a higher delivery utility enhances single-copy routing.

Previous file sharing applications in mobile ad hoc networks (MANETs) have attracted more efficiency of file querying suffers from the distinctive properties of MANETs including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica sharing with minimum average querying delay communication links between mobile nodes are transient and network maintenance overhead is a major performance bottleneck for data transmission. Low node density makes it difficult to establish end-to-end connection, thus impeding a continuous end-to-end path between a source and a destination.

DTN networks for communication in outer space, but is now directly accessible from our pockets both the characteristics of MANETs and the requirements of P2P file sharing an application layer overlay network. We port a DTN type solution into an infrastructure-less environment like MANETs and leverage peer mobility to reach data in other disconnected networks. This is done by implementing an asynchronous communication model, store-delegate-and-forward, like DTNs, where a peer can delegate unaccomplished file download or query tasks to special peers. To improve data transmission performance while reducing communication overhead, we select these special peers by the expectation of encountering them again in future and assign them different download starting point on the file.

2.1.1 DISADVANTAGES:

Limited communication range and resource have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range.

The disadvantage is that it lacked of transparency. Receiving a URL explicitly points to certain data replica and that the browser will become aware of the switching between the different machines.
And for scalability, the necessity of making contact with is always the same, the single service machine can make it bottleneck as the number of clients increase which makes situation worse.

2.2 PROPOSED SYSTEM:

We propose a distributed file replication protocol that can approximately realize the optimal file replication rule with the two mobility models in a distributed manner in the OFRR in the two mobility models (i.e., Equations (22) and (28)) have the same form, we present the protocol in this section without indicating the specific mobility model. We first introduce the challenges to realize the OFRR and our solutions. We then propose a replication protocol to realize OFRR and analyze the effect of the protocol.

We propose the priority competition and split file replication protocol (PCS). We first introduce how a node retrieves the parameters needed in PCS and then present the detail of PCS. we briefly prove the effectiveness of PCS. We refer to the process in which a node tries to copy a file to its neighbors as one round of replica distribution. Recall that when a replica is created for a file with P, the two copies will replicate files with priority P =2 in the next round. This means that the creation of replicas will not increase the overall P of the file. Also, after each round, the priority value of each file or replica is updated based on the received requests for the file.

Then, though some replicas may be deleted in the competition, the total amount of requests for the file remains stable, making the sum of the Ps of all replicas and the original file roughly equal to the overall priority value of the file. Then, we can regard the replicas of a file as an entity that competes for available resource in the system with accumulated priority P in each round. Therefore, in each round of replica distribution, based on our design of PCS, the overall probability of creating a replica for an original file

2.2.1 ADVANTAGES:

The community-based mobility model has been used in content dissemination or routing algorithms for disconnected MANETs/DTNs to depict node mobility. In this model, the entire test area is split into different sub-areas, denoted as caves. Each cave holds one community.

RWP model, we can assume that the inter-meeting time among nodes follows exponential distribution. Then, the probability of meeting a node is independent with the previous encountered node. Therefore, we define the meeting ability of a node as the average number of nodes it meets in a unit time and use it to investigate the optimal file replication.

PCS, we used two routing protocols in the experiments. We first used the Static Wait protocol in the GENI experiment, in which each query stays on the source node waiting for the destination. We then used a probabilistic routing protocol (PROPHET) in which a node routes requests to the neighbor with the highest meeting ability.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Tools : Netbeans 7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

OFRR PROTOCOL:

4.1 ALGORITHM

PSEUDO-CODE FOR PCS ALGORITHM:

4.2 MODULES:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

4.3 MODULE DESCRIPTION:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

CHAPTER 8

8.1 CONCLUSION & FUTURE:

In this paper, we investigated the problem of how to allocate limited resources for file replication for the purpose of global optimal file searching efficiency in MANETs. Unlike previous protocols that only consider storage as resources, we also consider file holder’s ability to meet nodes as available resources since it also affects the availability of files on the node. We first theoretically analyzed the influence of replica distribution on the average querying delay under constrained available resources with two mobility models, and then derived an optimal replication rule that can allocate resources to file replicas with minimal average querying delay.

Finally, we designed the priority competition and split replication protocol (PCS) that realizes the optimal replication rule in a fully distributed manner. Extensive experiments on both GENI testbed, NS-2, and event-driven simulator with real traces and synthesized mobility confirm both the correctness of our theoretical analysis and the effectiveness of PCS in MANETs. In this study, we focus on a static set of files in the network. In our future work, we will theoretically analyze a more complex environment including file dynamics (file addition and deletion, file timeout) and dynamic node querying pattern.

k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

05/08/201902/07/2019 by admin

k-Nearest Neighbor Classification overSemantically Secure Encrypted Relational DataBharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEEAbstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among governmentagencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of variousprivacy issues, many theoretical and practical solutions to the classification problem have been proposed under different securitymodels. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encryptedform, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preservingclassification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. Inparticular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality ofdata, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop asecure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposedprotocol using a real-world dataset under different parameter settings.Index Terms—Security, k-NN classifier, outsourced databases, encryptionÇ1 INTRODUCTIONRECENTLY, the cloud computing paradigm [1] is revolutionizingthe organizations’ way of operating their dataparticularly in the way they store, access and process data.As an emerging computing paradigm, cloud computingattracts many organizations to consider seriously regardingcloud potential in terms of its cost-efficiency, flexibility, andoffload of administrative overhead. Most often, organizationsdelegate their computational operations in addition totheir data to the cloud. Despite tremendous advantages thatthe cloud offers, privacy and security issues in the cloud arepreventing companies to utilize those advantages. Whendata are highly sensitive, the data need to be encryptedbefore outsourcing to the cloud. However, when data areencrypted, irrespective of the underlying encryption scheme,performing any data mining tasks becomes very challengingwithout ever decrypting the data. There are other privacyconcerns, demonstrated by the following example.Example 1. Suppose an insurance company outsourced itsencrypted customers database and relevant data miningtasks to a cloud. When an agent from the companywants to determine the risk level of a potential newcustomer, the agent can use a classification method todetermine the risk level of the customer. First, theagent needs to generate a data record q for thecustomer containing certain personal information ofthe customer, e.g., credit score, age, marital status, etc.Then this record can be sent to the cloud, and thecloud will compute the class label for q. Nevertheless,since q contains sensitive information, to protect thecustomer’s privacy, q should be encrypted before sendingit to the cloud.The above example shows that data mining overencrypted data (denoted by DMED) on a cloud also needsto protect a user’s record when the record is a part of a datamining process. Moreover, cloud can also derive useful andsensitive information about the actual data items by observingthe data access patterns even if the data are encrypted[2], [3]. Therefore, the privacy/security requirements of theDMED problem on a cloud are threefold: (1) confidentialityof the encrypted data, (2) confidentiality of a user’s queryrecord, and (3) hiding data access patterns.Existing work on privacy-preserving data mining(PPDM) (either perturbation or secure multi-party computation(SMC) based approach) cannot solve the DMED problem.Perturbed data do not possess semantic security, sodata perturbation techniques cannot be used to encrypthighly sensitive data. Also the perturbed data do not producevery accurate data mining results. Secure multi-partycomputation based approach assumes data are distributedand not encrypted at each participating party. In addition,many intermediate computations are performed based onnon-encrypted data. As a result, in this paper, we proposednovel methods to effectively solve the DMED problemassuming that the encrypted data are outsourced to a cloud.Specifically, we focus on the classification problem since itis one of the most common data mining tasks. Because eachclassification technique has their own advantage, to be concrete,this paper concentrates on executing the k-nearestneighbor classification method over encrypted data in thecloud computing environment._ B.K. Samanthula is with the Department of Computer Science, PurdueUniversity, 305 N. University Street, West Lafayette, IN 47907.E-mail: bsamanth@purdue.edu._ Y. Elmehdwi and W. Jiang are with the Department of Computer Science,Missouri University of Science and Technology, 310 CS Building,500 West 15th St., Rolla, MO 65409. E-mail: {ymez76, wjiang}@mst.edu.Manuscript received 23 Oct. 2013; revised 10 Sept. 2014; accepted 29 Sept.2014. Date of publication 19 Oct. 2014; date of current version 27 Mar. 2015.Recommended for acceptance by G. Miklau.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2364027IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 12611041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.1.1 Problem DefinitionSuppose Alice owns a database D of n records t1; . . . ; tn andm þ 1 attributes. Let ti;j denote the jth attribute value ofrecord ti. Initially, Alice encrypts her database attributewise,that is, she computes Epkðti;jÞ, for 1 _ i _ n and1 _ j _ m þ 1, where column ðm þ 1Þ contains the classlabels. We assume that the underlying encryption scheme issemantically secure [4]. Let the encrypted database bedenoted by D0. We assume that Alice outsources D0 as wellas the future classification process to the cloud.Let Bob be an authorized user who wants to classify hisinput record q ¼ hq1; . . . ; qmi by applying the k-NN classificationmethod based on D0. We refer to such a process asprivacy-preserving k-NN (PPkNN) classification overencrypted data in the cloud. Formally, we define thePPkNN protocol as:PPkNNðD0; qÞ ! cq;where cq denotes the class label for q after applying k-NNclassification method on D0 and q.1.2 Our ContributionsIn this paper, we propose a novel PPkNN protocol, a securek-NN classifier over semantically secure encrypted data. Inour protocol, once the encrypted data are outsourced to thecloud, Alice does not participate in any computations.Therefore, no information is revealed to Alice. In addition,our protocol meets the following privacy requirements:_ Contents of D or any intermediate results should notbe revealed to the cloud._ Bob’s query q should not be revealed to the cloud._ cq should be revealed only to Bob. Also, no otherinformation should be revealed to Bob._ Data access patterns, such as the records correspondingto the k-nearest neighbors of q, should not berevealed to Bob and the cloud (to prevent any inferenceattacks).We emphasize that the intermediate results seen by the cloudin our protocol are either newly generated randomizedencryptions or random numbers. Thus, which data recordscorrespond to the k-nearest neighbors and the output classlabel are not known to the cloud. In addition, after sendinghis encrypted query record to the cloud, Bob does notinvolve in any computations. Hence, data access patterns arefurther protected from Bob (see Section 5 for more details).The rest of the paper is organized as follows. We discussthe existing related work and some concepts as a backgroundin Section 2. A set of privacy-preserving protocolsand their possible implementations are provided in Section3. The formal security proofs for the mentioned privacy-preservingprimitives are provided in Section 4. The proposedPPkNN protocol is explained in detail in Section 5. Section 6discusses the performance of the proposed protocol underdifferent parameter settings. We conclude the paper alongwith future work in Section 7.2 RELATED WORK AND BACKGROUNDDue to space limitations, here we briefly review the existingrelated work and provide some definitions as a background.Please refer to our technical report [5] for a more elaboratedrelated work and background.At first, it seems fully homomorphic cryptosystems (e.g.,[6]) can solve the DMED problem since it allows a thirdparty(that hosts the encrypted data) to execute arbitraryfunctions over encrypted data without ever decryptingthem. However, we stress that such techniques are veryexpensive and their usage in practical applications have yetto be explored. For example, it was shown in [7] that evenfor weak security parameters one “bootstrapping” operationof the homomorphic operation would take at least30 seconds on a high performance machine.It is possible to use the existing secret sharing techniquesin SMC, such as Shamir’s scheme [8], to develop a PPkNNprotocol. However, our work is different from the secretsharing based solution in the following aspect. Solutionsbased on the secret sharing schemes require at least threeparties whereas our work require only two parties. Forexample, the constructions based on Sharemind [9], a wellknownSMC framework which is based on the secret sharingscheme, assumes that the number of participating partiesis three. Thus, our work is orthogonal to Sharemind andother secret sharing based schemes.2.1 Privacy-Preserving Data MiningAgrawal and Srikant [10], Lindell and Pinkas [11] werethe first to introduce the notion of privacy-preservingunder data mining applications. The existing PPDM techniquescan broadly be classified into two categories: (i)data perturbation and (ii) data distribution. Agrawal andSrikant [10] proposed the first data perturbation techniqueto build a decision-tree classifier, and many othermethods were proposed later (e.g., [12], [13], [14]). However,as mentioned earlier in Section 1, data perturbationtechniques cannot be applicable for semantically secureencrypted data. Also, they do not produce accurate datamining results due to the addition of statistical noises tothe data. On the other hand, Lindell and Pinkas [11] proposedthe first decision tree classifier under the two-partysetting assuming the data were distributed between them.Since then much work has been published using SMCtechniques (e.g., [15], [16], [17]). We claim that the PPkNNproblem cannot be solved using the data distributiontechniques since the data in our case is encrypted and notdistributed in plaintext among multiple parties. For thesame reasons, we also do not consider secure k-NN methodsin which the data are distributed between two parties(e.g., [18]).2.2 Query Processing over Encrypted DataVarious techniques related to query processing overencrypted data have been proposed, e.g., [19], [20], [21].However, we observe that PPkNN is a more complex problemthan the execution of simple kNN queries overencrypted data [22], [23]. For one, the intermediate k-nearestneighbors in the classification process, should not be disclosedto the cloud or any users. We emphasize that therecent method in [23] reveals the k-nearest neighbors to theuser. Second, even if we know the k-nearest neighbors, it isstill very difficult to find the majority class label amongthese neighbors since they are encrypted at the first place to1262 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015prevent the cloud from learning sensitive information.Third, the existing work did not addressed the access patternissue which is a crucial privacy requirement from theuser’s perspective.In our most recent work [24], we proposed a novelsecure k-nearest neighbor query protocol over encrypteddata that protects data confidentiality, user’s query privacy,and hides data access patterns. However, as mentionedabove, PPkNN is a more complex problem and itcannot be solved directly using the existing secure k-nearestneighbor techniques over encrypted data. Therefore,in this paper, we extend our previous work in [24] andprovide a new solution to the PPkNN classifier problemover encrypted data.More specifically, this paper is different from our preliminarywork [24] in the following four aspects. First, inthis paper, we introduced new security primitives,namely secure minimum (SMIN), secure minimum out ofn numbers (SMINn), secure frequency (SF), and proposednew solutions for them. Second, the work in [24] did notprovide any formal security analysis of the underlyingsub-protocols. On the other hand, this paper provides formalsecurity proofs of the underlying sub-protocols aswell as the PPkNN protocol under the semi-honest model.Additionally, we discuss various techniques throughwhich the proposed PPkNN protocol can possibly beextended to a protocol that is secure under the malicioussetting. Third, our preliminary work in [24] addressesonly secure kNN query which is similar to Stage 1 ofPPkNN. However, Stage 2 in PPkNN is entirely new.Finally, our empirical analyses in Section 6 are based on areal dataset whereas the results in [24] are based on asimulated dataset. Furthermore, new experimental resultsare included in this paper.2.3 Threat ModelWe adopt the security definitions in the literature of securemulti-party computation [25], [26], and there are three commonadversarial models under SMC: semi-honest, covertand malicious. In this paper, to develop secure and efficientprotocols, we assume that parties are semi-honest. Briefly,the following definition captures the properties of a secureprotocol under the semi-honest model [27], [28].Definition 1. Let ai be the input of party Pi, PiðpÞ be Pi’s executionimage of the protocol p and bi be the output for party Picomputed from p. Then, p is secure if PiðpÞ can be simulatedfrom ai and bi such that distribution of the simulated image iscomputationally indistinguishable from PiðpÞ.In the above definition, an execution image generallyincludes the input, the output and the messages communicatedduring an execution of a protocol. To prove a protocolis secure under semi-honest model, we generally need toshow that the execution image of a protocol does not leakany information regarding the private inputs of participatingparties [28].2.4 Paillier CryptosystemThe Paillier cryptosystem is an additive homomorphic andprobabilistic public-key encryption scheme whose securityis based on the Decisional Composite Residuosity Assumption[4]. Let Epk be the encryption function with public keypk given by (N; g), where N is a product of two large primesof similar bit length and g is a generator in Z_N2 . Also, let Dskbe the decryption function with secret key sk. For any giventwo plaintexts a; b 2 ZN, the Paillier encryption schemeexhibits the following properties:(1) Homomorphic addition.DskðEpkða þ bÞÞ ¼ DskðEpkðaÞ _ EpkðbÞmodN2Þ:(2) Homomorphic multiplication.DskðEpkða _ bÞÞ ¼ DskðEpkðaÞbmodN2Þ:(3) Semantic security. The encryption scheme is semanticallysecure[28], [29]. Briefly, given a set of ciphertexts,an adversary cannot deduce any additionalinformation about the plaintext(s).For succinctness, we drop the modN2 term during homomorphicoperations in the rest of this paper.3 PRIVACY-PRESERVING PRIMITIVESHere we present a set of generic sub-protocols that willbe used in constructing our proposed k-NN protocol inSection 5. All of the below protocols are considered undertwo-party semi-honest setting. In particular, we assumethe existence of two semi-honest parties P1 and P2 suchthat the Paillier’s secret key sk is known only to P2whereas pk is public._ Secure multiplication (SM). This protocol considers P1with input ðEpkðaÞ; EpkðbÞÞ and outputs Epkða _ bÞ toP1, where a and b are not known to P1 and P2. Duringthis process, no information regarding a and b isrevealed to P1 and P2._ Secure squared euclidean distance (SSED). In this protocol,P1 with input ðEpkðXÞ; EpkðY ÞÞ and P2 with sksecurely compute the encryption of squared euclideandistance between vectors X and Y . Here X andY are m dimensional vectors where EpkðXÞ ¼hEpkðx1Þ; . . . ; EpkðxmÞi and EpkðYÞ ¼ hEpkðy1Þ; . . . ;EpkðymÞi. The output EpkðjX _ Y j2Þ will be knownonly to P1._ Secure bit-decomposition (SBD). Here P1 with inputEpkðzÞ and P2 securely compute the encryptions ofthe individual bits of z, where 0 _ z < 2l. The output½z_ ¼ hEpkðz1Þ; . . . ; EpkðzlÞi is known only to P1. Herez1 and zl are the most and least significant bits ofinteger z, respectively._ Secure minimum. In this protocol, P1 holds privateinput ðu0; v0Þ and P2 holds sk, where u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su (resp., sv)denotes the secret associated with u (resp., v). Thegoal of SMIN is for P1 and P2 to jointly compute theencryptions of the individual bits of minimum numberbetween u and v. In addition, they computeEpkðsminðu;vÞÞ. That is, the output is ð½minðu; vÞ_;Epkðsminðu;vÞÞÞ which will be known only to P1.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1263During this protocol, no information regarding thecontents of u; v; su; and sv is revealed to P1 and P2._ Secure minimum out of n numbers. In this protocol, weconsider P1 with n encrypted vectors ð½d1_; . . . ; ½dn_Þalong with their respective encrypted secrets and P2with sk. Here ½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi wheredi;1 and di;l are the most and least significant bitsof integer di respectively, for 1 _ i _ n. The secretof di is given by sdi . P1 and P2 jointly compute½minðd1; . . . ; dnÞ_. In addition, they computeEpkðsminðd1;…;dnÞÞ. At the end of this protocol, the outputð½minðd1; . . . ; dnÞ_; Epkðsminðd1;…;dnÞÞÞ is knownonly to P1. During SMINn, no information regardingany of di’s and their secrets is revealed to P1 and P2._ Secure Bit-OR (SBOR). P1 with input ðEpkðo1Þ;Epkðo2ÞÞ and P2 securely compute Epkðo1 _ o2Þ, whereo1 and o2 are 2 bits. The output Epkðo1 _ o2Þ is knownonly to P1._ Secure frequency. Here P1 with private inputðhEpkðc1Þ; . . .EpkðcwÞi; hEpkðc01Þ; . . . ; Epkðc0kÞiÞ and P2securely compute the encryption of the frequency ofcj, denoted by fðcjÞ, in the list hc01; . . . ; c0ki, for1 _ j _ w. Here we explicitly assume that cj’s areunique and c0i 2 fc1; . . . ; cwg, for 1 _ i _ k. The outputhEpkðfðc1ÞÞ; . . .; EpkðfðcwÞÞi will be known onlyto P1. During the SF protocol, no information regardingc0i, cj, and fðcjÞ is revealed to P1 and P2, for1 _ i _ k and 1 _ j _ w.Now we either propose a new solution or refer to themost efficient known implementation to each of theabove protocols. First of all, efficient solutions to SM,SSED, SBD and SBOR were discussed in [24]. Therefore,in this paper, we discuss SMIN, SMINn, and SF problemsin detail and propose new solutions to each one ofthem.Secure minimum. In this protocol, we assume that P1holds private input ðu0; v0Þ and P2 holds sk, whereu0 ¼ ð½u_; EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su and svdenote the secrets corresponding to u and v, respectively.The main goal of SMIN is to securely compute theencryptions of the individual bits of minðu; vÞ, denotedby ½minðu; vÞ_. Here ½u_ ¼ hEpkðu1Þ; . . . ; EpkðulÞi and ½v_ ¼hEpkðv1Þ; . . . ; EpkðvlÞi, where u1 (resp., v1) and ul (resp., vl)are the most and least significant bits of u (resp., v), respectively.In addition, they compute Epkðsminðu;vÞÞ, the encryptionof the secret corresponding to the minimum valuebetween u and v. At the end of SMIN, the outputð½minðu; vÞ_; Epkðsminðu;vÞÞÞ is known only to P1.We assume that 0 _ u; v < 2l and propose a novelSMIN protocol. Our solution to SMIN is mainly motivatedfrom the work of [24]. Precisely, the basic idea ofthe proposed SMIN protocol is for P1 to randomly choosethe functionality F (by flipping a coin), where F is eitheru > v or v > u, and to obliviously execute F with P2.Since F is randomly chosen and known only to P1, theresult of the functionality F is oblivious to P2. Based onthe comparison result and chosen F, P1 computes½minðu; vÞ_ and Epkðsminðu;vÞÞ locally using homomorphicproperties.Algorithm 1. SMINðu0; v0Þ ! ½minðu; vÞ_; Epkðsminðu;vÞÞRequire: P1 has u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ, where0 _ u; v < 2l; P2 has sk1: P1:(a). Randomly choose the functionality F(b). for i ¼ 1 to l do:_ Epkðui _ viÞ SMðEpkðuiÞ; EpkðviÞÞ_ Ti Epkðui _ viÞ_ Hi Hrii_1 _ Ti; ri 2R ZN and H0 ¼ Epkð0Þ_ Fi Epkð_1Þ _ Hi_ if F : u > v then:_ Wi EpkðuiÞ _ Epkðui _ viÞN_1_ Gi Epkðvi _ uiÞ _ Epkð^riÞ; ^ri 2R ZNelse_ Wi EpkðviÞ _ Epkðui _ viÞN_1_ Gi Epkðui _ viÞ _ Epkð^riÞ; ^ri 2R ZN_ Li Wi _ Fr0ii ; r0i 2R ZN(c). if F :u > v then: d Epkðsv _ suÞ _ EpkðrÞelse d Epkðsu _ svÞ _ EpkðrÞ, where r 2R ZN(d). G0 p1ðGÞ and L0 p2ðLÞ(e). Send d; G0 and L0 to P22: P2:(a). Receive d; G0 and L0 from P1(b). Decryption:Mi DskðL0iÞ, for 1 _ i _ l(c). if 9 j such that Mj ¼ 1 then a 1else a 0(d). if a ¼ 0 then:_ M0i Epkð0Þ, for 1 _ i _ l_ d0 Epkð0Þelse_ M0i G0i _ rN, where r 2R ZN and is different for1 _ i _ l_ d0 d _ rNd, where rd 2R ZN(e). Send M0;EpkðaÞ and d0 to P13: P1:(a). ReceiveM0;EpkðaÞ and d0 from P2(b).eMp_11 ðM0Þ and u d0 _ EpkðaÞN_r(c). _i eMi _ EpkðaÞN_^ri , for 1 _ i _ l(d). if F : u > v then:_ Epkðsminðu;vÞÞ EpkðsuÞ _ u_ Epkðminðu; vÞiÞ EpkðuiÞ _ _i, for 1 _ i _ lelse_ Epkðsminðu;vÞÞ EpkðsvÞ _ u_ Epkðminðu; vÞiÞ EpkðviÞ _ _i, for 1 _ i _ lThe overall steps involved in the SMIN protocol areshown in Algorithm 1. To start with, P1 initially chooses thefunctionality F as either u > v or v > u randomly. Then,using the SM protocol, P1 computes Epkðui _ viÞ with thehelp of P2, for 1 _ i _ l. After this, the protocol has the followingkey steps, performed by P1 locally, for 1 _ i _ l:_ Compute the encrypted bit-wise XOR between thebits ui and vi using the following formulation1Ti ¼ EpkðuiÞ _ EpkðviÞ _ Epkðui _ viÞN_2_ Compute an encrypted vector H by preserving thefirst occurrence of Epkð1Þ (if there exists one) in T byinitializing H0 ¼ Epkð0Þ. The rest of the entries of Hare computed as Hi ¼ Hrii_1 _ Ti. We emphasize that1. In general, for any two given bits o1 and o2, the propertyo1 _ o2 ¼ o1 þ o2 _ 2ðo1 _ o2Þ always holds.1264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015at most one of the entry in H is Epkð1Þ and theremaining entries are encryptions of either 0 or a randomnumber._ Then, P1 computes Fi ¼ Epkð_1Þ _ Hi. Note that“_1” is equivalent to “N _ 1” under ZN. From theabove discussions, it is clear that Fi ¼ Epkð0Þ at mostonce since Hi is equal to Epkð1Þ at most once. Also, ifFj ¼ Epkð0Þ, then index j is the position at which thebits of u and v differ first (starting from the most significantbit position).Now, depending on F, P1 creates two encrypted vectors Wand G as follows, for 1 _ i _ l:_ If F : u > v, computeWi ¼ Epkðui _ ð1 _ viÞÞ;Gi ¼ Epkðvi _ uiÞ _ Epkð^riÞ ¼ Epkðvi _ ui þ ^riÞ:_ If F : v > u, computeWi ¼ Epkðvi _ ð1 _ uiÞÞ;Gi ¼ Epkðui _ viÞ _ Epkð^riÞ ¼ Epkðui _ vi þ ^riÞ;where ^ri is a random number (hereafter denoted by 2R) inZN. The observation is that if F : u > v, then Wi ¼ Epkð1Þ iffui > vi, and Wi ¼ Epkð0Þ otherwise. Similarly, whenF : v > u, we have Wi ¼ Epkð1Þ iff vi > ui, and Wi ¼ Epkð0Þotherwise. Also, depending of F, Gi stores the encryption ofthe randomized difference between ui and vi which will beused in later computations.After this, P1 computes L by combining F and W. Moreprecisely, P1 computes Li ¼ Wi _ Fr0ii , where r0i is a randomnumber in ZN. The observation here is if 9 an index j suchthat Fj ¼ Epkð0Þ, denoting the first flip in the bits of u and v,then Wj stores the corresponding desired information, i.e.,whether uj > vj or vj > uj in encrypted form. In addition,depending on F, P1 computes the encryption of randomizeddifference between su and sv and stores it in d. Specifically,if F : u > v, then d ¼ Epkðsv _ su þ rÞ. Otherwise,d ¼ Epkðsu _ sv þ rÞ, where r 2R ZN.After this, P1 permutes the encrypted vectors G and Lusing two random permutation functions p1 and p2. Specifically,P1 computes G0 ¼ p1ðGÞ and L0 ¼ p2ðLÞ, and sendsthem along with d to P2. Upon receiving, P2 decrypts L0component-wise to get Mi ¼ DskðL0iÞ, for 1 _ i _ l, andchecks for index j. That is, if Mj ¼ 1, then P2 sets a to 1, otherwisesets it to 0. In addition, P2 computes a new encryptedvector M0 depending on the value of a. Precisely, if a ¼ 0,then M0i ¼ Epkð0Þ, for 1 _ i _ l. Here Epkð0Þ is different foreach i. On the other hand, when a ¼ 1, P2 sets M0i to the rerandomizedvalue of G0i. That is, M0i ¼ G0i _ rN, where theterm rN comes from re-randomization and r 2R ZN shouldbe different for each i. Furthermore, P2 computesd0 ¼ Epkð0Þ if a ¼ 0. However, when a ¼ 1, P2 sets d0 tod _ rNd, where rd is a random number in ZN. Then, P2 sendsM0; EpkðaÞ and d0 to P1. After receiving M0; EpkðaÞ and d0, P1computes the inverse permutation of M0 aseM¼ p_11 ðM0Þ.Then, P1 performs the following homomorphic operationsto compute the encryption of ith bit of minðu; vÞ, i.e.,Epkðminðu; vÞiÞ, for 1 _ i _ l:_ Remove the randomness fromeMi by computing_i ¼ eMi _ EpkðaÞN_^ri_ If F : u>v, compute Epkðminðu; vÞiÞ ¼ EpkðuiÞ__i ¼ Epkðui þ a _ ðvi _ uiÞÞ. Otherwise, computeEpkðminðu; vÞiÞ¼EpkðviÞ _ _i ¼ Epkðviþ a _ ðui _ viÞÞ.Also, depending on F, P1 computes Epkðsminðu;vÞÞ as follows.If F : u > v, P1 computes Epkðsminðu;vÞÞ ¼ EpkðsuÞ _ u,where u¼d0 _ EpkðaÞN_r. Otherwise, he/she computesEpkðsminðu;vÞÞ¼ EpkðsvÞ _ u.In the SMIN protocol, one main observation (upon whichwe can also justify the correctness of the final output) is thatif F : u > v, then minðu; vÞi ¼ ð1 _ aÞ _ ui þ a _ vi alwaysholds, for 1 _ i _ l. On the other hand, if F : v > u, thenminðu; vÞi ¼ a _ ui þ ð1 _ aÞ _ vi always holds. Similar conclusionscan be drawn for sminðu;vÞ. We emphasize that usingsimilar formulations one can also design a SMAX protocolto compute ½maxðu; vÞ_ and Epkðsmaxðu;vÞÞ. Also, we stressthat there can be multiple secrets of u and v that can be fedas input (in encrypted form) to SMIN and SMAX. For example,let s1u and s2u (resp., s1vand s2v) be two secrets associatedwith u (resp., v). Then the SMIN protocol takesð½u_; Epkðs1uÞ; Epkðs2uÞÞ and ð½v_; Epkðs1vÞ; Epkðs2vÞÞ as P1’s inputand outputs ½minðu; vÞ_; Epkðs1minðu;vÞÞ and Epkðs2minðu;vÞÞ to P1.Example 2. For simplicity, consider that u ¼ 55, v ¼ 58, andl ¼ 6. Suppose su and sv be the secrets associated with uand v, respectively. Assume that P1 holds ð½55_; EpkðsuÞÞð½58_; EpkðsvÞÞ. In addition, we assume that P1’s randompermutation functions are as given below. Without lossof generality, suppose P1 chooses the functionalityF : v > u. Then, various intermediate results based onthe SMIN protocol are as shown in Table 1. Followingfrom Table 1, we observe that:_ At most one of the entry in H is Epkð1Þ, namelyH3, and the remaining entries are encryptions ofeither 0 or a random number in ZN._ Index j ¼ 3 is the first position at which the correspondingbits of u and v differ.TABLE 1P1 Chooses F Asv > uWhere u ¼ 55 and v ¼ 58½u_ ½v_ Wi Gi Gi Hi Fi Li Gi’ L0i Mi _i mini1 1 0 r 0 0 _1 r 1 þr r r 0 11 1 0 r 0 0 _1 r r r r 0 10 1 1 _1 þ r 1 1 0 1 1þr r r _1 01 0 0 1 þ r 1 r r r _1 þr r r 1 11 1 0 r 0 r r r r 1 1 0 11 0 0 1 þ r 1 r r r r r r 1 1All column values are in encrypted form exceptMi column. Also, r 2R ZN isdifferent for each row and column.i = 1 2 3 4 5 6# # # # # #p1ðiÞ = 6 5 4 3 2 1p2ðiÞ = 2 1 5 6 3 4SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1265_ F3 ¼ Epkð0Þ since H3 is equal to Epkð1Þ. Also, sinceM5 ¼ 1, P2 sets a to 1._ Epkðsminðu;vÞÞ ¼ Epkða _ su þ ð1 _ aÞ _ svÞ¼ EpkðsuÞ.At the end, only P1 knows ½minðu; vÞ_ ¼ ½u_ ¼ ½55_ andEpkðsminðu;vÞÞ ¼ EpkðsuÞ.Secure minimum out of n numbers. Consider P1 with privateinput ð½d1_; . . . ; ½dn_Þ along with their encrypted secretsand P2 with sk, where 0 _ di < 2l and ½di_ ¼ hEpkðdi;1Þ;. . . ; Epkðdi;lÞi, for 1 _ i _ n. Here the secret of di is denotedby Epkðsdi Þ, for 1 _ i _ n. The main goal of the SMINn protocolis to compute ½minðd1; . . . ; dnÞ_ ¼ ½dmin_ without revealingany information about di’s to P1 and P2. In addition, theycompute the encryption of the secret corresponding to theglobal minimum, denoted by Epkðsdmin Þ. Here we constructa new SMINn protocol by utilizing SMIN as the buildingblock. The proposed SMINn protocol is an iterativeapproach and it computes the desired output in an hierarchicalfashion. In each iteration, minimum between a pair ofvalues and the secret corresponding to the minimum valueare computed (in encrypted form) and fed as input to thenext iteration, thus, generating a binary execution tree in abottom-up fashion. At the end, only P1 knows the finalresult ½dmin_ and Epkðsdmin Þ.Algorithm 2. SMINnðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_; Epkðsdn ÞÞÞ! ð½dmin_; Epkðsdmin ÞÞRequire: P1 has ðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_;Epkðsdn ÞÞÞ; P2 has sk1: P1:(a). ½d0i_ ½di_ and s0i Epkðsdi Þ, for 1 _ i _ n(b). num n2: for i ¼ 1 to dlog2 ne:(a). for 1 _ j _ num2_ _:_ if i ¼ 1 then:_ ð½d02j_1_; s02j_1Þ SMINðx; yÞ, wherex ¼ ð½d02j_1_; s02j_1Þ and y ¼ ð½d02j_; s02jÞ_ ½d02j_ 0 and s02j 0else_ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ SMINðx; yÞ, wherex ¼ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ and y ¼ ð½d02ij_1_; s02ij_1Þ_ ½d02ij_1_ 0 and s02ij_1 0(b). num num2_ _3: P1: ½dmin_ ½d01_ and EpkðsdminÞ s01The overall steps involved in the proposed SMINn protocolare highlighted in Algorithm 2. Initially, P1 assigns ½di_and Epkðsdi Þ to a temporary vector ½d0i_ and variable s0i, for1 _ i _ n, respectively. Also, he/she creates a global variablenum and initializes it to n, where num represents thenumber of (non-zero) vectors involved in each iteration.Since the SMINn protocol executes in a binary tree hierarchy(bottom-up fashion), we have dlog2 ne iterations, and in eachiteration, the number of vectors involved varies. In the firstiteration (i.e., i ¼ 1), P1 with private inputðð½d02j_1_; s02j_1Þ; ð½d02j_; s02jÞÞ and P2 with sk involve in theSMIN protocol, for 1 _ j _ num2_ _. At the end of the first iteration,only P1 knows ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ, andnothing is revealed to P2, for 1 _ j _ num2_ _. Also, P1 storesthe result ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ in ½d02j_1_ ands02j_1, respectively. In addition, P1 updates the values of½d02j_, s02j to 0 and num to num2_ _, respectively.During the ith iteration, only the non-zero vectors (alongwith the corresponding encrypted secrets) are involved inSMIN, for 2 _ i _ dlog2 ne. For example, during the seconditeration (i.e., i ¼ 2), only ð½d01_; s01Þ; ð½d03_; s03Þ, and so on areinvolved. Note that in each iteration, the output is revealedonly to P1 and num is updated to num2_ _. At the end ofSMINn, P1 assigns the final encrypted binary vector ofglobal minimum value, i.e., ½minðd1; . . . ; dnÞ_ which is storedin ½d01_, to ½dmin_. Also, P1 assigns s01 to Epkðsdmin Þ.Example 3. Suppose P1 holds h½d1_; . . . ; ½d6_i (i.e., n ¼ 6). Forsimplicity, here we are assuming that there are no secretsassociated with di’s. Then, based on the SMINn protocol,the binary execution tree (in a bottom-up fashion) tocompute ½minðd1; . . . ; d6Þ_ is shown in Fig. 1. Note that,initially ½d0i_ ¼ ½di_.Secure frequency. Let us consider a situation where P1holds private input ðhEpkðc1Þ; . . . ; EpkðcwÞi; hEpkðc01Þ;. . . ; Epkðc0kÞiÞ and P2 holds the secret key sk. The goal of theSF protocol is to securely compute EpkðfðcjÞÞ, for 1 _ j _ w.Here fðcjÞ denotes the number of times element cj occurs(i.e., frequency) in the list hc01; . . . ; c0ki. We explicitly assumethat c0i 2 fc1; . . . ; cwg, for 1 _ i _ k.The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is revealed onlyto P1. During the SF protocol, neither c0i nor cj is revealed toP1 and P2. Also, fðcjÞ is kept private from both P1 and P2,for 1 _ i _ k and 1 _ j _ w.The overall steps involved in the proposed SF protocolare shown in Algorithm 3. To start with, P1 initially computesan encrypted vector Si such that Si;j ¼ Epkðcj _ c0iÞ, for1 _ j _ w. Then, P1 randomizes Si component-wise to getS0i;j ¼ Epkðri;j _ ðcj _ c0iÞÞ, where ri;j is a random number inZN. After this, for 1 _ i _ k, P1 randomly permutes S0icomponent-wise using a random permutation function pi(known only to P1). The output Zi piðS0iÞ is sent to P2.Upon receiving, P2 decrypts Zi component-wise, computesa vector ui and proceeds as follows:_ If DskðZi;jÞ ¼ 0, then ui;j is set to 1. Otherwise, ui;j isset to 0._ The observation is, since c0i 2 fc1; . . . ; cwg, thatexactly one of the entries in vector Zi is an encryptionof 0 and the rest are encryptions of randomnumbers. This further implies that exactly one of thedecrypted values of Zi is 0 and the rest are randomnumbers. Precisely, if ui;j ¼ 1, then c0i ¼ cp_1ðjÞ.Fig. 1. Binary execution tree for n ¼ 6 based on SMINn.1266 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015_ Compute Ui;j ¼ Epkðui;jÞ and send it to P1, for1 _ i _ k and 1 _ j _ w.Then, P1 performs row-wise inverse permutation on it to getVi ¼ p_1i ðUiÞ, for 1 _ i _ k. Finally, P1 computesEpkðcjÞ ¼Qki¼1 Vi;j locally, for 1 _ j _ w.Algorithm 3. SFðL;L0Þ ! hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞiRequire: P1 has L ¼ hEpkðc1Þ; . . .;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . . ;Epkðc0kÞi and hp1; . . . ; pki; P2 has sk1: P1:(a). for i ¼ 1 to k do:_ Ti Epkðc0iÞN_1_ for j ¼ 1 to w do:_ Si;j EpkðcjÞ _ Ti_ S0i;j Si;jri;j , where ri;j 2R ZN_ Zi piðS0iÞ(b). Send Z to P22: P2:(a). Receive Z from P1(b). for i ¼ 1 to k do_ for j ¼ 1 to w do:_ if DskðZi;jÞ ¼ 0 then ui;j 1else ui;j 0_ Ui;j Epkðui;jÞ(c). Send U to P13: P1:(a). Receive U from P2(b). Vi p_1i ðUiÞ, for 1 _ i _ k(c). EpkðfðcjÞÞQki¼1 Vi;j, for 1 _ j _ w4 SECURITY ANALYSIS OF PRIVACY-PRESERVINGPRIMITIVES UNDER THE SEMI-HONEST MODELFirst of all, we emphasize that the outputs in the above mentionedprotocols are always in encrypted format, and areknown only to P1. Also, all the intermediate results revealedto P2 are either random or pseudo-random.Since the proposed SMIN protocol (which is used as asub-routine in SMINn) is more complex than other protocolsmentioned above and due to space limitations, we are motivatedto provide its security proof rather than providingproofs for each protocol. Therefore, here we only include aformal security proof for the SMIN protocol based on thestandard simulation argument [28]. Nevertheless, we stressthat similar proof strategies can be used to show that otherprotocols are secure under the semi-honest model. For completeness,we provided the security proofs for the other protocolsin our technical report [5].4.1 Proof of Security for SMINAs mentioned in Section 2.3, to formally prove that SMIN issecure [28] under the semi-honest model, we need to showthat the simulated image of SMIN is computationally indistinguishablefrom the actual execution image of SMIN.An execution image generally includes the messagesexchanged and the information computed from these messages.Therefore, according to Algorithm 1, let the executionimage of P2 be denoted by PP2 ðSMINÞ, given byfhd; s þ r modNi; hG0i;mi þ ^ri mod Ni; hL0i; aig:Observe that s þ r modN and mi þ ^ri mod N are derivedupon decrypting d and G0i, for 1 _ i _ l, respectively. Notethat the modulo operator is implicit in the decryption function.Also, P2 receives L0 from P1 and let a denote the (oblivious)comparison result computed from L0. Without loss ofgenerality, suppose the simulated image of P2 bePSP2ðSMINÞ, given byfhd_; r_i; hs01;i; s02;ii; hs03;i; a0i j for 1 _ i _ lg:Here d_; s01;i and s03;i are randomly generated from ZN2whereas r_ and s02;i are randomly generated from ZN. Inaddition, a0 denotes a random bit. Since Epk is a semanticallysecure encryption scheme with resulting ciphertextsize less than N2, d is computationally indistinguishablefrom d_. Similarly, G0i and L0i are computationally indistinguishablefrom s01;i and s03;i, respectively. Also, as r and ^riare randomly generated from ZN, s þ r mod N andmi þ ^ri modN are computationally indistinguishable fromr_ and s02;i, respectively. Furthermore, because the functionalityis randomly chosen by P1 (at step 1(a) of Algorithm 1),a is either 0 or 1 with equal probability. Thus, a is computationallyindistinguishable from a0. Combining all theseresults together, we can conclude that PP2 ðSMINÞ is computationallyindistinguishable from PSP2ðSMINÞ based on Definition1. This implies that during the execution of SMIN, P2does not learn any information regarding u; v; su; sv and theactual comparison result. Intuitively speaking, the informationP2 has during an execution of SMIN is either randomor pseudo-random, so this information does not discloseanything regarding u; v; su and sv. Additionally, as F isknown only to P1, the actual comparison result is obliviousto P2.On the other hand, the execution image of P1, denoted byPP1 ðSMINÞ, is given byPP1 ðSMINÞ ¼ fM0i; EpkðaÞ; d0 j for 1 _ i _ lg:M0i and d0 are encrypted values, which are random in ZN2 ,received from P2 (at step 3(a) of Algorithm 1). Let the simulatedimage of P1 be PSP1ðSMINÞ, wherePSP1ðSMINÞ ¼ fs04;i; b0; b00 j for 1 _ i _ lg:The values s04;i; b0 and b00 are randomly generated from ZN2 .Since Epk is a semantically secure encryption scheme withresulting ciphertext size less than N2, it implies thatM0i; EpkðaÞ and d0 are computationally indistinguishablefrom s04;i; b0 and b00, respectively. Therefore, PP1 ðSMINÞ iscomputationally indistinguishable from PSP1ðSMINÞ basedon Definition 1. As a result, P1 cannot learn any informationregarding u; v; su; sv and the comparison result during theexecution of SMIN. Putting everything together, we claimthat the proposed SMIN protocol is secure under the semihonestmodel (according to Definition 1).5 THE PROPOSED PPKNN PROTOCOLIn this section, we propose a novel privacy-preserving k-NNclassification protocol, denoted by PPkNN, which isSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1267constructed using the protocols discussed in Section 3 asbuilding blocks. As mentioned earlier, we assume thatAlice’s database consists of n records, denoted byD ¼ ht1; . . . ; tni, and m þ 1 attributes, where ti;j denotes thejth attribute value of record ti. Initially, Alice encrypts herdatabase attribute-wise, that is, she computes Epkðti;jÞ, for1 _ i _ n and 1 _ j _ m þ 1, where column ðm þ 1Þ containsthe class labels. Let the encrypted database be denotedby D0. We assume that Alice outsources D0 as well as thefuture classification process to the cloud. Without loss ofgenerality, we assume that all attribute values and theireuclidean distances lie in ½0; 2lÞ. In addition, let w denote thenumber of unique class labels in D.In our problem setting, we assume the existence of twonon-colluding semi-honest cloud service providers, denotedby C1 and C2, which together form a federated cloud. Underthis setting, Alice outsources her encrypted database D0 toC1 and the secret key sk to C2. Here it is possible for thedata owner Alice to replace C2 with her private server.However, if Alice has a private server, we can argue thatthere is no need for data outsourcing from Alice’s point ofview. The main purpose of using C2 can be motivated bythe following two reasons. (i) With limited computingresource and technical expertise, it is in the best interest ofAlice to completely outsource its data management andoperational tasks to a cloud. For example, Alice may wantto access her data and analytical results using a smart phoneor any device with very limited computing capability.(ii) Suppose Bob wants to keep his input query and accesspatterns private from Alice. In this case, if Alice uses a privateserver, then she has to perform computations assumedby C2 under which the very purpose of outsourcing theencrypted data to C1 is negated.In general, whether Alice uses a private server or cloudservice provider C2 actually depends on her resources. Inparticular to our problem setting, we prefer to use C2 as thisavoids the above mentioned disadvantages (i.e., in case ofAlice using a private server) altogether. In our solution,after outsourcing encrypted data to the cloud, Alice doesnot participate in any future computations.The goal of the PPkNN protocol is to classify users’query records using D0 in a privacy-preserving manner.Consider an authorized user Bob who wants to classifyhis query record q ¼ hq1; . . . ; qmi based on D0 in C1. Theproposed PPkNN protocol mainly consists of the followingtwo stages:_ Stage 1—Secure Retrieval of k-Nearest Neighbors(SRkNN). In this stage, Bob initially sends his queryq (in encrypted form) to C1. After this, C1 and C2involve in a set of sub-protocols to securely retrieve(in encrypted form) the class labels corresponding tothe k-nearest neighbors of the input query q. At theend of this step, encrypted class labels of k-nearestneighbors are known only to C1._ Stage 2—Secure Computation of Majority Class(SCMCk). Following from Stage 1, C1 and C2 jointlycompute the class label with a majority votingamong the k-nearest neighbors of q. At the end ofthis step, only Bob knows the class label correspondingto his input query record q.The main steps involved in the proposed PPkNN protocolare as shown in Algorithm 4. We now explain each ofthe two stages in PPkNN in detail.Algorithm 4. PPkNNðD0; qÞ ! cqRequire: C1 has D0 and p; C2 has sk; Bob has q1: Bob:(a). Compute EpkðqjÞ, for 1 _ j _ m(b). Send EpkðqÞ ¼ hEpkðq1Þ; . . .;EpkðqmÞi to C12: C1 and C2:(a). C1 receives EpkðqÞ from Bob(b). for i ¼ 1 to n do:_ EpkðdiÞ SSEDðEpkðqÞ;EpkðtiÞÞ_ ½di_ SBDðEpkðdiÞÞ3: for s ¼ 1 to k do:(a). C1 and C2:_ ð½dmin_;EpkðIÞ;Epkðc0ÞÞ SMINnðu1; . . . ; unÞ, whereui ¼ ð½di_;EpkðIti Þ;Epkðti;mþ1ÞÞ_ Epkðc0sÞ Epkðc0Þ(b). C1:_ D EpkðIÞN_1_ for i ¼ 1 to n do:_ ti EpkðiÞ _ D_ t0i trii , where ri 2R ZN_ b pðt0Þ; send b to C2(c). C2:_ b0i DskðbiÞ, for 1 _ i _ n_ Compute U0, for 1 _ i _ n:_ if b0i ¼ 0, then U0i ¼ Epkð1Þ_ otherwise, U0i ¼ Epkð0ÞSend U0 to C1(d). C1: V p_1ðU0Þ(e). C1 and C2, for 1 _ i _ n and 1 _ g _ l:_ Epkðdi;gÞ SBORðVi; Epkðdi;g ÞÞ4: SCMCkðEpkðc01Þ; . . .;Epkðc0kÞÞ5.1 Stage 1: Secure Retrieval of k-NearestNeighborsDuring Stage 1, Bob initially encrypts his query q attributewise,that is, he computes EpkðqÞ ¼ hEpkðq1Þ; . . .; EpkðqmÞiand sends it to C1. The main steps involved in Stage 1 areshown as steps 1 to 3 in Algorithm 4. Upon receiving EpkðqÞ,C1 with private input ðEpkðqÞ; EpkðtiÞÞ and C2 with the secretkey sk jointly involve in the SSED protocol. HereEpkðtiÞ ¼ hEpkðti;1Þ; . . . ; Epkðti;mÞi, for 1 _ i _ n. The outputof this step, denoted by EpkðdiÞ, is the encryption of squaredeuclidean distance between q and ti, i.e., di ¼ jq _ tij2. Asmentioned earlier, EpkðdiÞ is known only to C1, for1 _ i _ n. We emphasize that the computation of exacteuclidean distance between encrypted vectors is hard toachieve as it involves square root. However, in our problem,it is sufficient to compare the squared euclidean distances asit preserves relative ordering. Then, C1 with input EpkðdiÞand C2 securely compute the encryptions of the individualbits of di using the SBD protocol. Note that the output½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi is known only to C1, where di;1and di;l are the most and least significant bits of di, for1 _ i _ n, respectively.After this, C1 and C2 compute the encryptions of classlabels corresponding to the k-nearest neighbors of q in an1268 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015iterative manner. More specifically, they compute Epkðc01Þ inthe first iteration, Epkðc02Þ in the second iteration, and so on.Here c0s denotes the class label of sth nearest neighbor to q,for 1 _ s _ k. At the end of k iterations, only C1 knowshEpkðc01Þ; . . . ; Epkðc0kÞi. To start with, consider the first iteration.C1 and C2 jointly compute the encryptions of the individualbits of the minimum value among d1; . . . ; dn andencryptions of the location and class label corresponding todmin using the SMINn protocol. That is, C1 with inputðu1; . . . ; unÞ and C2 with sk compute ð½dmin_; EpkðIÞ; Epkðc0ÞÞ,where ui ¼ ð½di_; EpkðIti Þ; Epkðti;mþ1ÞÞ, for 1 _ i _ n. Heredmin denotes the minimum value among d1; . . . ; dn; Iti andti;mþ1 denote the unique identifier and class label correspondingto the data record ti, respectively. Specifically,ðIti; ti;mþ1Þ is the secret information associated with ti. Forsimplicity, this paper assumes Iti ¼ i. In the output, I and c0denote the index and class label corresponding to dmin. Theoutput ð½dmin_; EpkðIÞ; Epkðc0ÞÞ is known only to C1. Now, C1performs the following operations locally:_ Assign Epkðc0Þ to Epkðc01Þ. Remember that, accordingto the SMINn protocol, c0 is equivalent to the classlabel of the data record that corresponds to dmin.Thus, it is same as the class label of the most nearestneighbor to q._ Compute the encryption of difference between I andi, where 1 _ i _ n. That is, C1 computes ti ¼ EpkðiÞ_EpkðIÞN_1 ¼ Epkði _ IÞ, for 1 _ i _ n._ Randomize ti to get t0i ¼ trii ¼ Epkðri _ ði _ IÞÞ,where ri is a random number in ZN. Note that t0i isan encryption of either 0 or a random number, for1 _ i _ n. Also, it is worth noting that exactly one ofthe entries in t0 is an encryption of 0 (which happensiff i ¼ I) and the rest are encryptions of randomnumbers. Permute t0 using a random permutationfunction p (known only to C1) to get b ¼ pðt0Þ andsend it to C2.Upon receiving b, C2 decrypts it component-wise to getb0i ¼ DskðbiÞ, for 1 _ i _ n. After this, he/she computes anencrypted vector U0 of length n such that U0i ¼ Epkð1Þ ifb0i ¼ 0, and Epkð0Þ otherwise. Since exactly one of entries int0 is an encryption of 0, this further implies that exactly oneof the entries in U0 is an encryption of 1 and the rest of themare encryptions of 0’s. It is important to note that if b0k ¼ 0,then p_1ðkÞ is the index of the data record that correspondsto dmin. Then, C2 sends U0 to C1. After receiving U0, C1 performsinverse permutation on it to get V ¼ p_1ðU0Þ. Notethat exactly one of the entries in V is Epkð1Þ and the remainingare encryptions of 0’s. In addition, if Vi ¼ Epkð1Þ, then tiis the most nearest tuple to q. However, C1 and C2 do notknow which entry in V corresponds to Epkð1Þ.Finally, C1 updates the distance vectors ½di_ due to the followingreason:_ It is important to note that the first nearest tuple to qshould be obliviously excluded from further computations.However, since C1 does not know the recordcorresponding to Epkðc01Þ, we need to obliviouslyeliminate the possibility of choosing this recordagain in next iterations. For this, C1 obliviouslyupdates the distance corresponding to Epkðc01Þ to themaximum value, i.e., 2l _ 1. More specifically, C1updates the distance vectors with the help of C2using the SBOR protocol as below, for 1 _ i _ n and1 _ g _ lEpkðdi;gÞ ¼ SBOR_Vi; Epkðdi;gÞ_:Note that when Vi ¼ Epkð1Þ, the corresponding distancevector di is set to the maximum value. That is,under this case, ½di_ ¼ hEpkð1Þ; . . . ; Epkð1Þi. On theother hand, when Vi ¼ Epkð0Þ, the OR operation hasno effect on the corresponding encrypted distancevector.The above process is repeated until k iterations, and ineach iteration ½di_ corresponding to the current chosen labelis set to the maximum value. However, C1 and C2 doesnot know which ½di_ is updated. In iteration s, Epkðc0sÞ isknown only to C1. At the end of Stage 1, C1 hashEpkðc01Þ; . . .; Epkðc0kÞi, the list of encrypted class labels ofk-nearest neighbors to the query q.5.2 Stage 2: Secure Computation of Majority ClassWithout loss of generality, let us assume that Alice’s datasetD consists of w unique class labels denoted by c ¼hc1; . . . ; cwi. We assume that Alice outsources her list ofencrypted classes to C1. That is, Alice outsourceshEpkðc1Þ; . . . ; EpkðcwÞi to C1 along with her encrypted databaseD0 during the data outsourcing step. Note that, forsecurity reasons, Alice may add dummy categories into thelist to protect the number of class labels, i.e., w from C1 andC2. However, for simplicity, we assume that Alice does notadd any dummy categories to c.During Stage 2, C1 with private inputs L ¼ hEpkðc1Þ; . . . ;EpkðcwÞi and L0 ¼ hEpkðc01Þ; . . . ; Epkðc0kÞi, and C2 with sksecurely compute EpkðcqÞ. Here cq denotes the majority classlabel among c01; . . . ; c0k. At the end of stage 2, only Bob knowsthe class label cq.Algorithm 5. SCMCkðEpkðc01Þ; . . .; Epkðc0kÞÞ ! cqRequire: hEpkðc1Þ; . . .; EpkðcwÞi, hEpkðc01Þ; . . .;Epkðc0kÞi are knownonly to C1; sk is known only to C21: C1 and C2:(a). hEpkðfðc1ÞÞ; . . . ;EpkðfðcwÞÞi SFðL;L0Þ, whereL ¼ hEpkðc1Þ; . . . ;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . .; Epkðc0kÞi(b). for i ¼ 1 to w do:_ ½fðciÞ_ SBDðEpkðfðciÞÞÞ(c). ð½fmax_;EpkðcqÞÞ SMAXwðc1; . . . ;cwÞ, whereci ¼ ð½fðciÞ_; EpkðciÞÞ, for 1 _ i _ w2: C1:(a). gq EpkðcqÞ _ EpkðrqÞ, where rq 2R ZN(b). Send gq to C2 and rq to Bob3: C2:(a). Receive gq from C1(b). g0q DskðgqÞ; send g0q to Bob4: Bob:(a). Receive rq from C1 and g0q from C2(b). cq g0q _ rq modNThe overall steps involved in Stage 2 are shown inAlgorithm 5. To start with, C1 and C2 jointly compute theSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1269encrypted frequencies of each class label using the k-nearestset as input. That is, they compute EpkðfðciÞÞ using ðL;L0Þas C1’s input to the secure frequency (SF) protocol, for1 _ i _ w. The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is knownonly to C1. Then, C1 with EpkðfðciÞÞ and C2 with sk involvein the secure bit-decomposition protocol to compute ½fðciÞ_,that is, vector of encryptions of the individual bits of fðciÞ,for 1 _ i _ w. After this, C1 and C2 jointly involve in theSMAXw protocol. Briefly, SMAXw utilizes the sub-routineSMAX to eventually compute ð½fmax_; EpkðcqÞÞ in an iterativefashion. Here ½fmax_ ¼ ½maxðfðc1Þ; . . . ; fðcwÞÞ_ and cq denotesthe majority class out of L0. At the end, the outputð½fmax_; EpkðcqÞÞ is known only to C1. After this, C1 computesgq ¼ Epkðcq þ rqÞ, where rq is a random number in ZNknown only to C1. Then, C1 sends gq to C2 and rq to Bob.Upon receiving gq, C2 decrypts it to get the randomizedmajority class label g0q ¼ DskðgqÞ and sends it to Bob. Finally,upon receiving rq from C1 and g0q from C2, Bob computes theoutput class label corresponding to q as cq ¼ g0q _ rq mod N.5.3 Security Analysis of PPkNN under theSemi-Honest ModelFirst of all, we stress that due to the encryption of q and bysemantic security of the Paillier cryptosystem, Bob’s inputquery q is protected from Alice, C1 and C2 in our PPkNNprotocol. Apart from guaranteeing query privacy, the goalof PPkNN is to protect data confidentiality and hide dataaccess patterns.In this paper, to prove a protocol’s security under thesemi-honest model, we adopted the well-known securitydefinitions from the literature of SMC. More specifically, asmentioned in Section 2.3, we adopt the security proofsbased on the standard simulation paradigm [28]. For presentationpurpose, we provide formal security proofs(under the semi-honest model) for Stages 1 and 2 of PPkNNseparately. Note that the outputs returned by each sub-protocolare in encrypted form and known only to C1.5.3.1 Proof of Security for Stage 1As mentioned earlier, the computations involved in Stage 1of PPkNN are given as steps 1 to 3 in Algorithm 4. For simplicity,we consider the messages exchanged between C1and C2 in a single iteration (similar analysis can be deducedfor other iterations).According to Algorithm 4, the execution image of C2 isgiven by PC2 ðPPkNNÞ ¼ fhbi; b0ii j for 1 _ i _ ng where bi isan encrypted value which is random in ZN2 . Also, b0i isderived upon decrypting bi by C2. Remember that, exactlyone of the entries in b0 is 0 and the rest are random numbersin ZN. Without loss of generality, let the simulated image ofC2 be given PSC2ðPPkNNÞ ¼ fha01;i; a02;ii j for 1 _ i _ ng. Herea01;i is randomly generated from ZN2 and the vector a02 is randomlygenerated in such a way that exactly one of theentries is 0 and the rest are random numbers in ZN. SinceEpk is a semantically secure encryption scheme with resultingciphertext size less than ZN2 , we claim that bi is computationallyindistinguishable from a01;i. In addition, since therandom permutation function p is known only to C1, b0 is arandom vector of exactly one 0 and random numbers in ZN.Thus, b0 is computationally indistinguishable from a02. Bycombining the above results, we can conclude thatPC2 ðPPkNNÞ is computationally indistinguishable fromPSC2ðPPkNNÞ. This implies that C2 does not learn anythingduring the execution of Stage 1.On the other hand, the execution image of C1 is given byPC1 ðPPkNNÞ ¼ fU0g where U0 is an encrypted value sent byC2 (at step 3(c) of Algorithm 4). Let the simulated image ofC1 in Stage 1 be PSC1ðPPkNNÞ ¼ fa0g. Here the value of a0 israndomly generated from ZN2 . Since Epk is a semanticallysecure encryption scheme with resulting ciphertexts in ZN2 ,we claim that U0 is computationally indistinguishable froma0. This implies that PC1 ðPPkNNÞ is computationally indistinguishablefrom PSC1ðPPkNNÞ. Hence, C1 cannot learnanything during the execution of Stage 1 in PPkNN. Combiningall these results together, it is clear that Stage 1 issecure under the semi-honest model.In each iteration, it is worth pointing out that C1 andC2 do not know which data record belongs to currentglobal minimum. Thus, data access patterns are protectedfrom both C1 and C2. Informally speaking, at step 3(c) ofAlgorithm 4, a component-wise decryption of b revealsthe tuple that satisfy the current global minimum distanceto C2. However, due to the random permutation by C1, C2cannot trace back to the corresponding data record. Also,note that decryption operations on vector b by C2 willresult in exactly one 0 and the rest of the results are randomnumbers in ZN. Similarly, since U0 is an encryptedvector, C1 cannot know which tuple corresponds to currentglobal minimum distance.5.3.2 Security Proof for Stage 2In a similar fashion, we can formally prove that Stage 2 ofPPkNN is secure under the semi-honest model. Briefly,since the sub-protocols SF, SBD, and SMAXw are secure, noinformation is revealed to C2. Also, the operations performedby C1 are entirely on encrypted data and thus noinformation is revealed to C1.Furthermore, the output data of Stage 1 which are passedas input to Stage 2 are in encrypted format. Therefore, thesequential composition of the two stages lead to our PPkNNprotocol and we claim it to be secure under the semi-honestmodel according to the Composition Theorem [28]. In particular,based on the above discussions, it is clear that theproposed PPkNN protocol protects the confidentiality ofthe data, the user’s input query, and also hides data accesspatterns from Alice, C1; and C2. Note that Alice does notparticipate in any computations of PPkNN.5.4 Security under the Malicious ModelThe next step is to extend our PPkNN protocol into a secureprotocol under the malicious model. Under the maliciousmodel, an adversary (i.e., either C1 or C2) can arbitrarilydeviate from the protocol to gain some advantage (e.g.,learning additional information about inputs) over the otherparty. The deviations include, as an example, for C1 (actingas a malicious adversary) to instantiate the PPkNN protocolwith modified inputs (say Epkðq0Þ and Epkðt0iÞÞ and to abortthe protocol after gaining partial information. However, inPPkNN, it is worth pointing out that neither C1 nor C21270 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015knows the results of Stages 1 and 2. In addition, all the intermediateresults are either random or pseudo-random values.Thus, even when an adversary modifies theintermediate computations he/she cannot gain any additionalinformation. Nevertheless, as mentioned above, theadversary can change the intermediate data or performcomputations incorrectly before sending them to the honestparty which may eventually result in the wrong output.Therefore, we need to ensure that all the computations performedand messages sent by each party are correct.Remember that the main goal of SMC is to ensure thehonest parties to get the correct result and to protect theirprivate input data from the malicious parties. Therefore,under the two-party SMC scenario, if both parties are malicious,there is no point to develop or adopt an SMC protocolat the first place. In the literature of SMC [30], it is the normthat at most one party can be malicious under the two-partyscenario. When only one of the party is malicious, the standardway of preventing the malicious party from misbehavingis to let the honest party validate the other party’s workusing zero-knowledge proofs [31]. However, checking thevalidity of operations at each step of PPkNN can significantlyincrease the cost.An alternative approach, as proposed in [32], is to instantiatetwo independent executions of the PPkNN protocol byswapping the roles of the two parties in each execution. Atthe end of the individual executions, each party receives theoutput in encrypted form. This is followed by an equalitytest on their outputs. More specifically, suppose Epk1 ðcq;1Þand Epk2 ðcq;2Þ be the outputs received by C1 and C2 respectively,where pk1 and pk2 are their respective public keys.Note that the outputs in our case are in encrypted formatand the corresponding ciphertexts (resulted from the twoexecutions) are under two different public key domains.Therefore, we stress that the equality test based on the additivehomomorphic encryption properties which was used in[32] is not applicable to our problem. Nevertheless, C1 andC2 can perform the equality test based on the traditionalgarbled-circuit technique [33].5.5 Complexity AnalysisThe total computation complexity of Stage 1 is bounded byOðn _ ðl þ m þ k _ l _ log2 nÞÞ encryptions and exponentiations.On the other hand, the total computation complexityof Stage 2 is bounded by Oðw _ ðl þ k þ l _ log2 wÞÞ encryptionsand exponentiations. Due to space limitations, werefer the reader to [5] for detailed complexity analysis ofPPkNN. In general, as w _ n, the computation cost of Stage1 should be significantly higher than that of Stage 2. Thisobservation is further justified by our empirical resultsgiven in the next section.6 EMPIRICAL RESULTSIn this section, we discuss some experiments demonstratingthe performance of our PPkNN protocol under differentparameter settings. We used the Paillier cryptosystem [4] asthe underlying additive homomorphic encryption schemeand implemented the proposed PPkNN protocol in C. Variousexperiments were conducted on a Linux machine withan Intel Xeon Six-Core CPU 3.07 GHz processor and 12 GBRAM running Ubuntu 12.04 LTS. To the best of our knowledge,our work is the first effort to develop a secure k-NNclassifier under the semi-honest model. There is no existingwork to compare with our approach. Hence, we evaluatethe performance of our PPkNN protocol under differentparameter settings.6.1 Dataset and Experimental SetupFor our experiments, we used the Car Evaluation datasetfrom the UCI KDD archive [34]. It consists of 1,728 records(i.e., n ¼ 1; 728) and six attributes (i.e., m ¼ 6). Also, there isa separate class attribute and the dataset is categorized intofour different classes (i.e., w ¼ 4). We encrypted this datasetattribute-wise, using the Paillier encryption whose key sizeis varied in our experiments, and the encrypted data werestored on our machine. Based on our PPkNN protocol, wethen executed a random query over this encrypted data. Forthe rest of this section, we do not discuss about the performanceof Alice since it is a one-time cost. Instead, we evaluateand analyze the performances of the two stages inPPkNN separately.6.2 Performance of PPkNNWe first evaluated the computation costs of Stage 1 inPPkNN for varying number of k-nearest neighbors. Also,the Paillier encryption key size K is either 512 or 1,024 bits.The results are shown in Fig. 2a. For K ¼ 512 bits, the computationcost of Stage 1 varies from 9.98 to 46.16 minuteswhen k is changed from 5 to 25, respectively. On the otherhand, when K ¼ 1;024 bits, the computation cost of Stage 1varies from 66.97 to 309.98 minutes when k is changed from5 to 25, respectively. In either case, we observed that thecost of Stage 1 grows almost linearly with k. For any givenk, we identified that the cost of Stage 1 increases by almost afactor of 7 whenever K is doubled. E.g., when k ¼ 10, StageFig. 2. Computation costs of PPkNN for varying number of k nearest neighbors and encryption key size K.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 12711 took 19.06 and 127.72 minutes to generate the encryptedclass labels of the 10 nearest neighbors under K ¼ 512 and1024 bits, respectively. Moreover, when k ¼ 5, we observethat around 66.29 percent of cost in Stage 1 is accounted dueto SMINn which is initiated k times in PPkNN (once in eachiteration). Also, the cost incurred due to SMINn increasesfrom 66.29 to 71.66 percent when k is increased from 5 to 25.We now evaluate the computation costs of Stage 2 forvarying k and K. As shown in Fig. 2b, for K ¼ 512 bits, thecomputation time for Stage 2 to generate the final class labelcorresponding to the input query varies from 0.118 to 0.285seconds when k is changed from 5 to 25. On the other hand,for K ¼ 1; 024 bits, Stage 2 took 0.789 and 1.89 secondswhen k ¼ 5 and 25, respectively. The low computation costsof Stage 2 were due to SMAXw which incurs significantlyless computations than SMINn in Stage 1. This further justifiesour theoretical analysis in Section 5.5. Note that, in ourdataset, w ¼ 4 and n ¼ 1;728. Like in Stage 1, for any givenk, the computation time of Stage 2 increases by almost a factorof 7 whenever K is doubled. E.g., when k ¼ 10, the computationtime of Stage 2 varies from 0.175 to 1.158 secondswhen the encryption key size K is changed from 512 to1,024 bits. As shown in Fig. 2b, a similar analysis can beobserved for other values of k and K.It is clear that the computation cost of Stage 1 is significantlyhigher than that of Stage 2 in PPkNN. Specifically,we observed that the computation time of Stage 1 accountsfor at least 99 percent of the total time in PPkNN. For example,when k ¼ 10 and K ¼ 512 bits, the computation costs ofStage 1 and 2 are 19.06 minutes and 0.175 seconds, respectively.Under this scenario, cost of Stage 1 is 99.98 percent ofthe total cost of PPkNN. We also observed that the totalcomputation time of PPkNN grows almost linearly with nand k.6.3 Performance Improvement of PPkNNWe now discuss two different ways to boost the efficiency ofStage 1 (as the performance of PPkNN depends primarilyon Stage 1) and empirically analyze their efficiency gains.First, we observe that some of the computations in Stage 1can be pre-computed. For example, encryptions of randomnumbers, 0 and 10s can be pre-computed (by the correspondingparties) in the offline phase. As a result, the onlinecomputation cost of Stage 1 (denoted by SRkNNo) isexpected to be improved. To see the actual efficiency gainsof such a strategy, we computed the costs of SRkNNo andcompared them with the costs of Stage 1 without an offlinephase (simply denoted by SRkNN) and the results forK ¼ 1;024 bits are shown in Fig. 2c. Irrespective of the valuesof k, we observed that SRkNNo is around 33 percentfaster than SRkNN. E.g., when k ¼ 10, the computationcosts of SRkNNo and SRkNN are 84.47 and 127.72 minutes,respectively (boosting the online running time of Stage 1 by33.86 percent).Our second approach to improve the performance ofStage 1 is by using parallelism. Since operations on datarecords are independent of one another, we claim that mostcomputations in Stage 1 can be parallelized. To empiricallyevaluate this claim, we implemented a parallel version ofStage 1 (denoted by SRkNNp) using OpenMP programmingand compared its cost with the costs of SRkNN (i.e., theserial version of Stage 1). The results for K ¼ 1;024 bits areshown in Fig. 2c. The computation cost of SRkNNp variesfrom 12.02 to 55.5 minutes when k is changed from 5 to 25.We observe that SRkNNp is almost six times more efficientthan SRkNN. This is because our machine has six cores andthus computations can be run in parallel on six separatethreads. Based on the above discussions, it is clear that efficiencyof Stage 1 can indeed be improved significantly usingparallelism.On the other hand, Bob’s computation cost in PPkNNis mainly due to the encryption of his input query. In ourdataset, Bob’s computation cost is 4 and 17 millisecondswhen K is 512 and 1,024 bits, respectively. It is apparentthat PPkNN is very efficient from Bob’s computationalperspective which is especially beneficial when he issuesqueries from a resource-constrained device (such asmobile phone and PDA).6.3.1 A Note on PracticalityOur PPkNN protocol is not very efficient without utilizingparallelization. However, ours is the first work to propose aPPkNN solution that is secure under the semi-honestmodel. Due to rising demands for data mining as a servicein cloud, we believe that our work will be very helpful tothe cloud community to stimulate further research alongthat direction. Hopefully, more practical solutions toPPkNN will be developed (either by optimizing our protocolor investigating alternative approaches) in the nearfuture.7 CONCLUSIONS AND FUTURE WORKTo protect user privacy, various privacy-preserving classificationtechniques have been proposed over the past decade.The existing techniques are not applicable to outsourceddatabase environments where the data resides in encryptedform on a third-party server. This paper proposed a novelprivacy-preserving k-NN classification protocol overencrypted data in the cloud. Our protocol protects the confidentialityof the data, user’s input query, and hides the dataaccess patterns. We also evaluated the performance of ourprotocol under different parameter settings.Since improving the efficiency of SMINn is an importantfirst step for improving the performance of our PPkNN protocol,we plan to investigate alternative and more efficientsolutions to the SMINn problem in our future work. Also,we will investigate and extend our research to other classificationalgorithms.ACKNOWLEDGMENTSThe authors wish to thank the anonymous reviewers fortheir invaluable feedback and suggestions. This work hasbeen partially supported by the US National Science Foundationunder grant CNS-1011984.TA 1273

Innovative Schemes for Resource Allocation in the Cloud for Media Streaming Applications

05/08/201902/07/2019 by admin

—Media streaming applications have recently attracted a large number of users in the Internet. With the advent of thesebandwidth-intensive applications, it is economically inefficient to provide streaming distribution with guaranteed QoS relying only oncentral resources at a media content provider. Cloud computing offers an elastic infrastructure that media content providers (e.g., Videoon Demand (VoD) providers) can use to obtain streaming resources that match the demand. Media content providers are charged forthe amount of resources allocated (reserved) in the cloud. Most of the existing cloud providers employ a pricing model for the reservedresources that is based on non-linear time-discount tariffs (e.g., Amazon CloudFront and Amazon EC2). Such a pricing scheme offersdiscount rates depending non-linearly on the period of time during which the resources are reserved in the cloud. In this case, an openproblem is to decide on both the right amount of resources reserved in the cloud, and their reservation time such that the financial coston the media content provider is minimized. We propose a simple—easy to implement—algorithm for resource reservation thatmaximally exploits discounted rates offered in the tariffs, while ensuring that sufficient resources are reserved in the cloud. Based onthe prediction of demand for streaming capacity, our algorithm is carefully designed to reduce the risk of making wrong resourceallocation decisions. The results of our numerical evaluations and simulations show that the proposed algorithm significantly reducesthe monetary cost of resource allocations in the cloud as compared to other conventional schemes.Index Terms—Media streaming, cloud computing, non-linear pricing models, network economicsÇ1 INTRODUCTIONMEDIA streaming applications have recently attractedlarge number of users in the Internet. In 2010, thenumber of video streams served increased 38.8 percent to24.92 billion as compared to 2009 [1]. This huge demand createsa burden on centralized data centers at media contentproviders such as Video-on-Demand (VoD) providers tosustain the required QoS guarantees [2]. The problembecomes more critical with the increasing demand forhigher bit rates required for the growing number of higherdefinitionvideo quality desired by consumers. In thispaper, we explore new approaches that mitigate the cost ofstreaming distribution on media content providers usingcloud computing.A media content provider needs to equip its data-centerwith over-provisioned (excessive) amount of resources inorder to meet the strict QoS requirements of streaming traffic.Since it is possible to anticipate the size of usage peaksfor streaming capacity in a daily, weekly, monthly, andyearly basis, a media content provider can make long terminvestments in infrastructure (e.g., bandwidth and computingcapacities) to target the expected usage peak. However,this causes economic inefficiency problems in view of flashcrowdevents. Since data-centers of a media content providerare equipped with resources that target the peakexpected demand, most servers in a typical data-center of amedia content provider are only used at about 30 percent oftheir capacity [3]. Hence, a huge amount of capacity at theservers will be idle most of the time, which is highly wastefuland inefficient.Cloud computing creates the possibility for media contentproviders to convert the upfront infrastructure investmentto operating expenses charged by cloud providers (e.g., Netflix moved its streaming servers to Amazon WebServices (AWS) [4], [5]). Instead of buying over-provisionedservers and building private data-centres, media contentproviders can use computing and bandwidth resources ofcloud service providers. Hence, a media content providercan be viewed as a re-seller of cloud resources, where itpays the cloud service provider for the streaming resources(bandwidth) served from the cloud directly to clients of themedia content provider. This paradigm reduces theexpenses of media content providers in terms of purchaseand maintenance of over-provisioned resources at theirdata-centres.In the cloud, the amount of allocated resources can bechanged adaptively at a fine granularity, which is commonlyreferred to as auto-scaling. The auto-scaling ability ofthe cloud enhances resource utilization by matching thesupply with the demand. So far, CPU and memory are thecommon resources offered by the cloud providers (e.g.,Amazon EC2 [6]). However, recently, streaming resources(bandwidth) have become a feature offered by many cloudproviders to users with intensive bandwidth demand (e.g.,Amazon CloudFront and Octoshape) [5], [7], [8], [9]._ A. Alasaad and H.M. Behairy are with the National Center for Electronics,Communications, and Photonics, King Abdulaziz City for Science andTechnology, Riyadh, Saudi Arabia.E-mail: {alasaad, hbehairy}@kacst.edu.sa._ K. Shafiee and V.C.M. Leung are with the Department of Electrical andComputer Engineering, University of British Columbia, Vancouver, BC,Canada. E-mail : {kshafiee, vleung}@ece.ubc.ca.Manuscript received 7 Nov. 2013; revised 23 Jan. 2014; accepted 24 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.Recommended for acceptance by H. Wu.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316827IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 10211045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.The delay sensitive nature of media streaming trafficposes unique challenges due to the need for guaranteedthroughput (i.e., download rate no smaller than the videoplayback rate) in order to enable users to smoothly watchvideo content on-line. Hence, the media content providerneeds to allocate streaming resources in the cloud such thatthe demand for streaming capacity can be sustained at anyinstant of time.The common type of resource provisioning plan that isoffered by cloud providers is referred to as on-demandplan. This plan allows the media content provider to purchaseresources upon needed. The pricing model that cloudproviders employ for the on-demand plan is the pay-peruse.Another type of streaming resource provisioning plansthat is offered by many cloud providers is based on resourcereservation. With the reservation plan, the media contentprovider allocates (reserves) resources in advance and pricingis charged before the resources are utilized (upon receivingthe request by the cloud provider, i.e., prepaidresources). The reserved streaming resources are basicallythe bandwidth (streaming data-rate) at which the cloud providerguarantees to deliver to clients of the media contentprovider (content viewers) according to the required QoS.In general, the prices (tariffs) of the reservation plan arecheaper than those of the on-demand plan (i.e., time discountrates are only offered to the reserved (prepaid)resources). We consider a pricing model for resource reservationin the cloud that is based on non-linear time-discounttariffs. In such a pricing scheme, the cloud serviceprovider offers higher discount rates to the resourcesreserved in the cloud for longer times. Such a pricingscheme enables a cloud service provider to better utilize itsabundantly available resources because it encourages consumersto reserve resources in the cloud for longer times.This pricing scheme is currently being used by many cloudproviders [10]. See for example the pricing of virtualmachines (VM) in the reservation phase defined by AmazonEC2 in February 2010. In this case, an open problem isto decide on both the optimum amount of resourcesreserved in the cloud (i.e., the prepaid allocated resources),and the optimum period of time during which thoseresources are reserved such that the monetary cost on themedia content provider is minimized. In order for a mediacontent provider to address this problem, prediction offuture demand for streaming capacity is required to helpwith the resource reservation planning. Many methodshave been proposed in prior works to predict the demandfor streaming capacity [11], [12], [13], [14].Our main contribution in this paper is a practical—easyto implement—Prediction-Based Resource Allocation algorithm(PBRA) that minimizes the monetary cost of resourcereservation in the cloud by maximally exploiting discountedrates offered in the tariffs, while ensuring that sufficientresources are reserved in the cloud with some level ofconfidence in probabilistic sense. We first describe the systemmodel. We formulate the problem based on the predictionof future demand for streaming capacity (Section 3).We then describe the design of our proposed algorithm forsolving the problem (Section 4).The results of our numerical evaluations and simulationsshow that the proposed algorithms significantly reduce themonetary cost of resource allocations in the cloud as comparedto other conventional schemes.2 RELATED WORKThe prediction of CPU utilization and user access demandfor web-based applications has been extensively studied inthe literature. A prediction method has been proposed withrespect to upcoming CPU utilization pattern demandsbased on neural networking and linear regression that is ofinterest in e-commerce applications [15]. Y. Lee et al. proposeda prediction method based on radial basis function(RBF) networks to predict the user access demand requestfor web type of services in web-based applications [16].Although the demand prediction for CPU utilization andweb applications has been studied for a relatively longperiod of time, the prediction of demand for media streaminghas gained popularity more recently [11], [12], [13], [14].The access behaviour of users in peer-to-peer (P2P) streamingwith time-series analysis techniques using non-stationarytime-series models was predicted in [11]. The method oftime-series prediction based on wavelet analysis was studiedin [12]. In [13], principal component analysis isemployed by the authors to extract the access pattern ofstreaming users. Although most of the above studies predictthe average streaming capacity demands, few papers havealso studied the volatility of the capacity demand, i.e., thedemand variance at any future point in time, which yieldsmore accurate risk factors [14]. The prediction of streamingbandwidth demand is outside the scope of this paper. Inthis work, we formulate the problem considering a givenprobability distribution function of prediction of futuredemand for streaming bandwidth. In addition to demandprediction for resource reservation, other relevant studieshave addressed the appropriate joint reservation of bandwidthresources on multiple cloud service providers withthe purpose of maximizing bandwidth utilization [12], [14].In [17], an adaptive resource provisioning scheme is presentedthat optimizes the bandwidth utilization whilesatisfying the required levels of QoS. Maximization of bandwidthutilization in turn helps cloud service providersreduce their expenses and maximize their revenues. In [18],an optimization framework for making dynamic resourceallocation decisions under risky and uncertain operatingenvironments was developed to maximize revenue whilereducing operating costs. This framework considered multipleclient QoS classes under uncertainty of workloads.Recently, streaming resources (e.g., bandwidth) havebecome a feature offered by many cloud providers to contentproviders with intensive bandwidth demand. Thestreaming of media content to content viewers located atdifferent geographical regions at guaranteed data-rate is apart of the service offered by the cloud provider. The commonway of implementing this service in the cloud is byhaving multiple data-centres inside the networks of theaccess connection providers (e.g., Internet Service Providers,ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20]. Cloud service providers may need tonegotiate contracts with a number of ISPs to co-locate theirservers into the networks of those ISPs. In this regard,another group of papers have focused on studying different1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015types of contracts between cloud service providers and ISPswith the purpose of minimizing the expenses of cloud providers[21]. However, an interesting design approach is tolook at the resource reservation problem from the viewpointof content providers. Obviously, content providers are moreinterested in minimizing their costs, i.e., the amount ofmoney that they are charged directly by cloud providers.To the best of our knowledge, very few studies haveinvestigated the problem of optimizing resource reservationwith the objective of minimizing the monetary costs for contentproviders. A good example is presented in [22],wherein a resource reservation optimization problem wasformulated to minimize the costs of content providers, socalledcloud consumers, using a stochastic programmingmodel. In the process of problem formulation, uncertaindemand and uncertain cloud providers’ resource prices areconsidered. In contrast, the optimization problem formulatedin our work takes into account a given probability distributionfunction obtained from aforementioned studiesfor the prediction of media streaming demands. Furthermore,the problem of cost minimization is addressed by utilizingthe discounted rates offered in the non-linear tariffs.To the best of our knowledge, none of the previous papershas investigated the problem of cost minimization for mediacontent providers in terms of monetary expenses by takinginto account both the penalties caused by the over-provisionedor under-provisioned reserved resources, and theadvance purchase of resources at cloud providers for justthe right period of time.3 SYSTEM MODEL AND PROBLEM FORMULATIONThe system model that we advocate in this paper for mediastreaming using cloud computing consists of the followingcomponents (Fig. 1)._ Demand forecasting module, which predicts thedemand of streaming capacity for every video channelduring future period of time._ Cloud broker, which is responsible on behalf of themedia content provider for both allocating theappropriate amount of resources in the cloud, andreserving the time over which the required resourcesare allocated. Given the demand prediction, the brokerimplements our proposed algorithm to makedecision on resource allocations in the cloud.Both the demand forecasting module and thecloud broker are located in the media content providersite._ Cloud provider, which provides the streamingresources and delivers streaming traffic directly tomedia viewers.In this paper, we consider the case, wherein the cloudprovider charges media content providers for the reservedresources according to the period of time during which theresources are reserved in the cloud. In this case, the cloudprovider offers higher discount rates to the resourcesreserved in the cloud for longer times.Non-linear time-discount is a very popular pricingmodel. Non-linear tariffs are those with marginal rates varyingwith quantity purchased and time rented. Time discountrates are available in purchasing most types of goods.Products or services with time usage (e.g., rental cars, rentalreal-estates, loans, long distance telephone cards, photocopiers)are typically offered with variety of plans (pricingschemes) depending on the period of time the product isconsumed (reserved). It has been shown that such pricingschemes enable sellers to increase their revenues [23]. Manycloud providers also use such a pricing scheme [10]. See forexample pricing of virtual machines in reservation phasedefined by Amazon EC2 in February 2010. An example oftariffs using such a pricing scheme is shown in Fig. 2. Wecan see that the tariff is a function of both units of allocatedresources and reservation time.We observe the following dilemma: how can the mediacontent provider reserve sufficient resources in the cloud—based on the prediction of future streaming demand—suchthat no resource wastage is incurred, while QoS for theactual (real) streaming traffic is maintained with some levelof confidence (h) in probabilistic sense? Moreover, how canthe media content provider utilize the non-linear tariffs(time discount rates offered to the reserved (prepaid)resources) to minimize its monetary cost?Consider a video channel offered by a media content provider.Let DðtÞ be the actual demand for streaming capacityof the video channel at an instant of time t, and measured asthe number of users that stream the channel at instant oftime t multiplied by the data rate required for every downloadinguser to meet QoS guarantees. It has been shownthat DðtÞ is a random process that follows a log-normalFig. 2. An example of tariffs as function of allocated resources and reservationtime.Fig. 1. System model.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1023distribution with mean E½DðtÞ_ and variance (s) characterizedin [11] and [14], respectively.We denote the amount of streaming bandwidth that themedia content provider allocates in the cloud at any timeinstant t by AllocðtÞ. Since DðtÞ is a random process, themedia content provider needs to maintain reserved resourcesin the cloud AllocðtÞ such that in any instant of time,ProbabilityðDðtÞ _ AllocðtÞÞ _ h; (1)where h is a pre-determined threshold (level of confidence).Note that a higher h means a higher degree ofconfidence, in a probabilistic sense, that the reservedresources in the cloud AllocðtÞ meet the QoS guaranteesfor the actual streaming traffic at any future time instant t.However, increasing h increases the probability of wastageof reserved bandwidth (i.e., over-subscribed cost).Hence, proper selection of h is necessary. We shall proposean algorithm that determines the best value of h inSection 5. In this section, our objective is to find the rightamount of reserved resources and their corresponding reservationtime such that the monetary cost required forstreaming a video content (channel) is minimized giventhe constraint in Eq. (1).4 ALGORITHM DESIGNWe summarize the assumptions that we use in our analysisas follows:1) We assume that upon receiving the resource allocationrequest by the cloud provider from the mediacontent provider, the resources required are immediatelyallocated in the cloud, i.e., updating the cloudconfiguration and launching instances in cloud datacentresincurs no delay.2) Since the only resource that we consider in this workis bandwidth, it would be important to delve intothe relation between the cloud provider and contentdelivery networks (CDN). However, we assume thatthe provisioning of media content to media viewers(clients of the media content provider) located at differentgeographical regions at guaranteed data-rateis a part of the service offered by the cloud provider.The common way of implementing this service inthe cloud is by having multiple data-centres insidethe networks of the access connection providers(e.g., ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20].3) We assume that the media content provider ischarged for the reserved resources in the cloudupon making the request for resource reservation(i.e., prepaid resources); and therefore, the mediacontent provider cannot revoke, cancel, or change arequest for resource reservation previously submittedto the cloud.4) In clouds, tariffs (prices of different amount ofreserved resources in $ per unit of reservation time)are often given in a tabular form. Therefore, thecloud service provider requires a minimum reservationtime for any allocated resources, and onlyallows discrete levels (categories) of the amount ofallocated resources in the cloud. See for example thereservation phase in the Amazon CloudFrontresource provisioning plans [7].We take into account the aforementioned constraints andpropose a practical—easy to implement—algorithm forresource reservation in the cloud, such that the financialcost on the media content provider is minimized.Suppose that the media content provider can predict thedemand for streaming capacity of a video channel (i.e., thestatistical expected value of the demand E½DðtÞ_ is known)over a future period of time L using one of the methods in[11], [12], [13], [14]. The content provider reserves resourcesin the cloud according to the predicted demand. The proposedalgorithm is based on time-slots with varied durations(sizes). In every time-slot, the media content providermakes a decision to reserve amount of resources in thecloud. Both the amount of resources to be reserved and theperiod of time over which the reservation is made (durationof time-slots) vary from one time-slot to another, and aredetermined in our algorithm to yield the minimum overallmonetary cost (Fig. 3).We alternatively call a time-slot a window, and denote thewindow size (duration of the time-slot) by w. Since theactual demand varies during a window size, while allocatedresources in the cloud remain the same for the entire windowsize (according to the third assumption above), thealgorithm needs to reserve resources in every window jthat are sufficient to handle the maximum predicteddemand for streaming capacity during that window withsome probabilistic level of confidence h.We denote the amount of reserved resources in window jby Allocj. Since the decision on the amount of reservedresources is affected by the wrong prediction of futurestreaming demand, our on-line algorithm is carefullydesigned to obtain accurate demand prediction (by enablinga mechanism that continuously updates the demand forecastmodule according to the actual demand received at themedia content provider over time) in order to reduce therisk of making wrong resource reservation decisions (Fig. 1).We denote the monetary cost of the reserved resourcesduring window j by Costðwj;AllocjÞ, and can becomputed asCostðwj; AllocjÞ ¼ tariffðwj; AllocjÞ _ wj; (2)where tariffðwj; AllocjÞ represents the price (in $ per timeunit) charged by the cloud provider for amount of resourcesFig. 3. PBRA algorithm design.1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015Allocj reserved for period of time (window size) wj. Notethat the values of tariff and Cost in any window j dependon both the amount of allocated resources (Allocj) and theperiod of time over which resources are reserved (wj). Alsonote that the algorithm runs on-the-fly. More specifically,the demand forecast module predicts streaming capacitydemand in the upcoming period of time L and feeds thisinformation to our algorithm. The algorithm upon receivingthe demand prediction, computes the right size of windowj (i.e., w_j ), and the right amount of reserved resources inwindow j (i.e., Alloc_j ), such that the cost of the reservedresources during window j (i.e., Costðwj; AllocjÞ in (2)) isminimized; or equivalently, the discounted rates offered inthe tariffs are maximally utilized.Hence, the objective of our algorithm is to minimizeCostðwj; AllocjÞ 8j, subject toProbabilityðDðtÞ _ AllocðtÞÞ _ h; 8 t 2 L:In other words, our objective is to minimize the monetarycost of reserved resources such that the amount of reservedresources at any instant of time is guaranteed to meet theactual demand with probabilistic confidence equals to h. Aswe have discussed earlier, DðtÞ is a random process that followsa log-normal distribution with mean E½DðtÞ_ and variance(s) characterized in [11] and [14], respectively. Thus,using the constraint above, and for any window size wj, wecan compute the minimum amount of required reservedresources during window j (Allocj) by solving the followingformula for AllocjZAllocj01x _ sffiffiffiffiffiffi2p p e 1 2ðlnðxÞ mmaxs Þ2dx ¼ h; (3)where mmax is the maximum value of the predictedstreaming demand during the window j (i.e., mmax ¼ argmaxðE½DðtÞ_Þ 8 t 2 wj). Note that the Equation (3) followsfrom the log-normal probabilistic distribution of thedemand for streaming capacity.As we have discussed earlier, the cloud service provideroften requires a minimum reservation time for any allocatedresources (wmin), and only allows discrete levels (categories)of reservation times for any amount of allocated resourcesin the cloud. We therefore, assume that any reservationtime required at the cloud has to be in multiplicative orderof wmin (i.e., wj ¼ k _ wmin, where k is a positive integer).Thus, the algorithm employs a trial window (wh) to assist inmaking optimum decision on the size of every window j. Inparticular, for every window j, the algorithm starts an iterationprocess with a trial window of size wh ¼ wmin, andcomputes the cost rate (Xh ¼ tariffðwh; AllochÞ, where h isiteration index), and Alloch is computed by solving Eq. (3)for Alloc.Recall that due to the time discount rates offered in thetariffs, increasing the time during which the allocatedresources are reserved may lead to less monetary cost(higher discounted rate) on the media content provider(Fig. 2). However, increasing the window size (time-slot)significantly may also result in high over-provisioning(over-subscribed) cost as the media content provider has toallocate resources in the cloud that meet the highestdemand during the window period. Thus, in order torecognize whether the cost is decreasing or increasing withincreasing the window size, the trial window size (wh) isincreased one wmin unit in every iteration (i.e.,wh ¼ wh þ wmin) and the cost rate of this new trial windowsize is computed (Xhþ1). The algorithm keeps increasing thetrial window size until wh ¼ L in order to scan the entireperiod of time over which the demand was predicted (L)(Fig. 3), and finds the value of wh that yields the minimumcost; that is the optimum size of window j (w_j ). Since L isthe period of time over which the future demand is predicted,then wmin _ w_j _ L.During every window, the media content providerreceives the real (actual) streaming demand for the videochannel, which may be different from the predicteddemand. According to the actual demand, the demandforecast module updates its prediction and feeds thealgorithm with a newly predicted demand for anotherfuture period of time L (Fig. 1). The algorithm uponreceiving the updated demand prediction, computes theoptimum size of the next window, and reserves optimumresources in the next window, and so on. Thepseudo code for the proposed algorithm is shown inAlgorithm 1. In order to further clarify operations of theproposed algorithm (which we call it Prediction-BasedResource Allocation algorithm), an example is given inthe following.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 10254.1 Example: Finding the Right Amount of ReservedResources in Window j and Their ReservationtimeConsider the normalized predicted streaming demandgiven in Fig. 4 for a future period of time L ¼ 12. Letwmin ¼ 1; and let h ¼ 0:75. Assume that the amount ofreserved resources in the cloud can only take integer numbersof unit of resources (i.e., cloud provider applies certainlevels (categories) on the amount of allowed reservedresources, AllocðtÞ 2 f1; 2; 3; . . .g.For the given predicted demand, our algorithm findsthe optimum size of every window j and optimumamount of reserved resources in window j as follows.The algorithm starts iterations to determine the size ofthe first window (i.e., wj¼1). In the first iteration (h ¼ 1),wh¼1 ¼ 1, we can see that the maximum predicteddemand when wh¼1 ¼ 1 is 0:63 (Fig. 4). Thus, we havemmaxh ¼ 0:63. Using Eq. (3), we have Alloch¼1 ¼ 0:81.Since the cloud allows only discrete levels for reservedresources in the cloud, then Alloch¼1 must be rounded tothe nearest upper value allowed in the cloud. Thus,Alloch¼1 ¼ 1. Using tariff functions shown in Fig. 2, wehave the cost rate Xh ¼ tariffðwh¼1 ¼ 1; Alloch¼1 ¼ 1Þ ¼ 11.The iterations continue until wh ¼ L.We summarize the results of all iterations h performedfor window j ¼ 1 using our proposed algorithm in Table 1.From the table, we can see that the minimum value of costrate Xh is when h_ ¼ 10. Hence, the optimum window sizeis w_j ¼1 ¼ wh¼10 ¼ 10, and the optimum amount of reservedresources during window j ¼ 1 is Alloc_j¼1 ¼ Alloch¼10 ¼ 2.Similarly, we can find the optimum window size and optimumamount of resources in the next window (j ¼ 2) givenan updated prediction of the demand in another period offuture time L.5 HYBRID APPROACH FOR RESOURCEPROVISIONINGIn this section, we consider the case, wherein the cloud provideroffers two different types of streaming resource provisioningplans: the reservation plan and the on-demandplan. With the reservation plan, the media content providerreserves resources in advance and pricing is charged beforethe resources are utilized (upon receiving the request at thecloud provider, i.e., prepaid resources). With the ondemandplan, the media content provider allocates streamingresources upon needed. Pricing in the on-demand planis charged by pay-per-use basis. In general, the prices (tariffs)of the reservation plan are cheaper than those of the ondemandplan (i.e., time discount rates are only offered tothe reserved (prepaid) resources). Amazon CloudFront [7],Amazon EC2 [6], GoGrid [24], MS Azure, Op-Source, andTerre-mark are examples of cloud providers which offerInfrastructure-as-a-Service (IaaS) with both plans [10].When the media content provider only uses theresource reservation plan, the under-provisioning problemcan occur if the reserved (prepaid) resources are unable tofully meet the actual demand due to high fluctuatingdemand or prediction mismatch. Also, over-provisioningproblem can occur if the reserved (prepaid) resources aremore than the actual demand, in which parts of thereserved resources are wasted. However, when the cloudprovider offers both the reservation plan and the ondemandplan, the media content provider can allocateresources in the cloud more efficiently. In particular, themedia content provider can use reservation plan to benefitfrom the time-discounted rate, while use the on-demandplan to dynamically allocate streaming resources to its clientsat the moment when the reserved resources allocatedusing the reservation plan are unable to meet the actualdemand and extra resources are needed to fit the fluctuatedand unpredictable demands (e.g., flash crowd). Wecall this approach hybrid resource provisioning. This hybridapproach eliminates both the over-provisioning (over-subscribed)cost and the under-provisioning problem thatmay occur when using the reservation plan only.In this hybrid resource provisioning approach, tradeoffbetween the amount of resources allocated using the ondemandplan and the amount of resources allocated usingthe reservation plan needs to be adjusted in which thehybrid approach can optimally perform. In this section, wepropose an algorithm for this hybrid resource provisioningapproach that maximally benefits from the time discountedrate offered in the resource reservation plan, while eliminatingany over-provisioning cost of reserved resources suchthat the overall monetary cost of resource allocations in thecloud (including both the reserved resources and the ondemandresources) is minimized.Fig. 4. An example of predicted demand over a period of future timeL ¼ 12.TABLE 1Example: Summary of Results for Iterations Executed for Window j ¼ 11026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015As we have described in the previous section (Section4), the cost of allocated resources using the reservationplan depends on the parameter h. We referred to has the level of confidence. We have shown that usinghigher value of h results in higher amount of reservedresources in the cloud, and vice-versa. However, increasingthe value of h for the reserved resources may lead tothe over-provisioning problem, while decreasing the valueof h may lead to the under-provisioning problem. Sincepricing of resource allocation in the on-demand plan ishigher than the reservation plan, one may erroneouslybelieve that increasing the value of h would alwaysreduce the overall monetary cost since the portion ofreserved (discounted) resources in the cloud is increased.However, reserving too many resources (i.e., using highvalue of h for the reserved (prepaid) resources) may befar from optimal because it may significantly increasethe over-provisioning (over-subscribed) cost. Hence, thishybrid approach requires that the content provider selectthe right value of h for the reserved resources. Our proposedalgorithm in this section computes the optimumvalue of h (h_) that yields the minimum overall monetarycost of resource allocations in the cloud (both reservedand on-demand resources) when the media content provideruses this hybrid resource provisioning approach.Let us again assume that the media content provider canpredict the demand for a future period of time L. Let Chybridbe the price that the media content provider expects topay to the cloud provider for all streaming resource allocatedin the cloud using the hybrid approach (i.e., Chybrid isthe statistical mean of the cost). We can see that Chybrid is thesummation of two terms: the price charged for the reservedresources in every window j using the reservation plan(denoted by CRSVj ), and the expected cost of resources allocatedin the cloud during every window j using the ondemandplan (denoted by CODj ). Hence,Chybrid ¼Xj ðCRSVj þ CODj Þ: (4)Let AllocRSVj be the amount of reserved resources in windowj, while AllocODj be the amount of on-demand resources allocatedin window j. Let tariffðwRSVj ; AllocRSVj Þ be the tariffcharged for the reserved resources in window j, whiletariffðAllocODj Þ be the tariff charged for the on-demandresources in windowj. Note that the cost rate of the resourcesreservation plan, tariffðwRSVj ; AllocRSVj Þ, depends on bothwRSVj and AllocRSVj; while tariffðAllocODj Þ depends only onthe amount of allocated resources AllocODj . This is becauseno time discount rate is offered to the on-demand resources.Let x be a random variable representing the demand forstreaming capacity in any instant of time during window j,and fðxÞ be the probability density function of variable x.Note that when the amount of reserved resources in windowj (AllocRSVj ) is known, CODj can be computed by consideringthe event when AllocRSVj < x < 1. This isbecause when x < AllocRSVj , the amount of reservedresources in the cloud is sufficient to handle the actualstreaming demand and no need to allocate extra resourcesusing the on-demand plan. Thus, we can compute the costof reserved resources in window j (in $) asCRSVj ¼ wj _ tariffðwRSVj ; AllocRSVj Þ; (5)and consequently the expected (statistical mean) cost of theon-demand resources in window j can be computed asCODj ¼ wj _Z1AllocRSVjfðxÞ_tariffðx AllocRSVj Þ dx:(6)We shall consider a log-normal statistical probability distributionfðxÞ as discussed earlier [11], [14]. Thus, fðxÞ inEq. (6) can be written asfðxÞ ¼1x _ sffiffiffiffiffiffi2p p e 12_ðlnðxÞ mmaxs Þ2:As we have described in Section 4, the right amount ofreserved resources in window j (AllocRSVj ) can be determinedgiven the parameter h. Thus, Chybrid in Eq. (4) is afunction of the parameter h only. Our objective is to minimizeChybrid in Eq. (4), or equivalently determining the valueof h that minimizes the overall cost of allocated resourcesusing the hybrid approach. It is straight forward to showthat Chybrid is convex with respect to h. Thus, in order tominimize Chybrid, we need to find the optimum value of h(h_) using Equations (5) and (6).We can see that h_ can be easily solved numerically forevery window j if tariff functions are given (i.e.,tariffðt; AllocRSV ðtÞÞ and tariffðAllocODðtÞÞ for any durationof resource allocation). However, as we have discussedearlier, tariffs are often given in a tabular form. Moreover,the cloud service provider often requires a minimum reservationtime for any allocated resources, and only allowsdiscrete levels (categories) of allocated resources in thecloud. We take into account those constraints and proposean efficient heuristic algorithm for this hybridresource provisioning approach. The pseudo code of theproposed algorithms is shown in Algorithm 2.The algorithm works as follows. Suppose that h takesdiscrete values, and the total possible values of h is S. Forevery window j, the iteration process described in Algorithm1 is performed for every value of h in order to computeboth the right amount of reserved resources(AllocRSVj ) and the right time over which these resourcesare reserved (wRSVj ). When the amount of reserved resourcesin window j is determined, the amount of extra resourcesthat must be allocated using the on-demand plan inorder to fulfil the predicted streaming demand can be easilycomputed as AllocODj ¼ mmax AllocRSVj, where mmaxis the maximum value of the predicted streaming demandduring window j. Thus, the total corresponding cost rateof allocated resources in window j is computed asXh ¼ tariffðRSVj; AllocRSVjÞ þ tariffðAllocODj Þ, where h isthe iteration index. The iteration process continues, andout of all values of Xh computed for different values of h,the algorithm finds h_ corresponding to the minimumvalue. The algorithm is repeated for every window.We can see that the complexity of the proposed algorithm(measured in terms of number of iterations requiredfor every window) is Oð Lwmin _ SÞ. Thus, increasing the size ofALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1027S increases the complexity of the algorithm, but alsoincreases the accuracy of the algorithm. However, the complexityof our algorithm linearly scales with size of the input(S), which means that our algorithm executes efficiently.6 PERFORMANCE EVALUATIONWe first analytically derive a demand prediction functionthat we shall use in our performance evaluations (Section6.1). We then investigate the performance of our simple“on-line” Prediction-Based Resource Allocation algorithmproposed for reserving resources in the cloud, in terms ofboth monetary cost of reserved resources in the cloud andcomplexity (CPU time) (Section 6.2). We then compare theperformance of PBRA proposed for reserving resources inthe cloud against two other schemes: Fixed window sizeresource reservation scheme, and pay-as-you-go resourceallocation scheme (Section 6.2.2). Finally, we evaluate theperformance of our hybrid resource allocation algorithmproposed for the case when the cloud provider offers twostreaming resource provisioning plans: the reservation andon-demand, and show that our algorithm significantlyreduces the overall cost of resource allocation (Section 6.3).6.1 Demand ModelAs we have discussed so far, prediction of the future demandfor streaming capacity is required in order for the media contentprovider (e.g., VoD) to optimally reserve resources inthe cloud. In this section, we use a special case of the demandin which the function of expected (mean) future streamingdemand for a video channel (i.e., E½DðtÞ_) can be easily formulatedanalytically. Specifically, we assume that all mediastreaming demand for a video channel available at a localVoD provider is generated from users located in a single privatenetwork (e.g., users in a college or office campuses).What distinguishes the evolution of interest in a mediacontent among users of a private network from the Internetis that users in a private network are often socially connected(e.g., friends/colleagues in a social network). Thoseusers form a community and share similar interests. Thus,the demand of a media content grows quickly in the privatenetwork as interested users contact others (by either broadcastingthe knowledge about existence of the media contentto their friends in the social network, e.g., facebook, or usingEmail-group broadcast) and make them interested. However,the interest (demand) tapers off when a certain cumulativelevel of interest among users of the private network isreached. For example, a student, in a class of 100 students,can spread the knowledge about a video content to his classmates.If the popularity of this content among students inthe class is 0.2, the evolution of the demand increasesquickly over time as interested users contact others, buttapers off when all potential number of interested studentsin the class (20 students) get interested in the content andviewed the content. When all 20 students finish viewing thevideo content, the life-time of that content in this communitynetwork expires.We analytically characterize this viral evolution of interestin a media content among users of a private network.Let us assume that the number of friends to whom a user isconnected in a social network (node’s degree) at any instantof time on average is N. Let us further assume that a userwho receives the notification about the existence of the contentgets interested with probability p and re-broadcasts thenotification, in turn, to his friends on the social network,where p is the expected popularity of the content amongusers of the private network. We further assume that userswho receive multiple notifications for the same content donot rebroadcast the message.If the social network graph is fully connected (i.e., a notificationabout existence of the content reaches all users inthe private network), we can then use the fluid-flow modelto write the evolution of interest in a media content asdIðtÞdt ¼ IðtÞ½pðN gðtÞ _ NÞ_;where IðtÞ be the total number of interested users in the contentat time t (cumulative interest). ðgðtÞ _ NÞ accounts forthe fraction of N users who received multiple notificationsby time instant t, gðtÞ :¼ IðtÞ NT, where NT is the potential numberof users in the network who will ultimately becomeinterested in the content (NT ¼ 100 in Fig. 5), i.e., NT be themaximum expected level of the content cumulative interestin the private network.The above formula is a second order Bernoulli differentialequation and can be solved asIðtÞ ¼NT _ Ið0ÞIð0Þ þ ðNT Ið0ÞÞe p_N_t ; (7)1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015where Ið0Þ be the number of interested users at time t ¼ 0.We note that IðtÞ has an S-shape (Fig. 5). It shows that thenumber of interested users increases quickly when the contentbecomes available and then gradually decreases andtapers off once the level of interest reaches NT . This is similarto the demand function that was obtained using wordof-mouth spread of information by interested users (Bassmodel). Similar interest evolution was also observed whenmeasuring user interest in a video file on YouTube server[25], and when measuring user interest in popular videohosted on a university infrastructure (CoralCDN) [26].Given the evolution of interest in a media content IðtÞ inEq. (7), we can now use fluid-flow model to write the rate atwhich downloading users are completely served (finishdownloading the media content) asdSðtÞdt ¼ mQ _ ½IðtÞ SðtÞ_;where mQ is the required QoS streaming rate for everydownloading user (measured in bits/second), and SðtÞ isthe number of completely served users at time instant t. Theabove differential equation can be easily solved for SðtÞ.Hence, the expected value of demand for stream capacity ofthe content at any time t (measured in bits/second) isE½DðtÞ_ ¼dSðtÞdt ¼ mQ _ ½IðtÞ SðtÞ_: (8)6.2 Evaluation of the Algorithm (PBRA) Proposedfor Reserving Resources in the CloudThe algorithm that we evaluate in this section is the veryfirst algorithm that was proposed in Section 4 for resourcereservation in the cloud. We used time-discount rates similarto those used in the pricing model employed by AmazonEC2 [6] in order to derive tariff functions that we used inour evaluations. Those tariffs are non-linear functions ofboth the amount of reserved resources and reservationtime. An example of a tariff function that we used in ourevaluations for units of reserved resources equal to 3 isdepicted in Fig. 6. Note that time discounts are given to thereserved resources. For example, we can see that if themedia content provider wants to reserve (prepaid purchase)3 units of streaming resources for 6 time units, then the tariffis 13 $ per unit of reserved time; whereas the tariff is 14:25 ifthe same amount of resources is reserved for only 1 timeunit. We consider a log-normal probability distribution ofthe demand for streaming capacity with mean (i.e., predicteddemand E½DðtÞ_) computed by Eq. (8) for IðtÞ givenin Fig. 5, mQ ¼ 1, and variance of 3.6.2.1 Performance versus ComplexityAs we have discussed in Section 4, our proposed algorithm(PBRA) employs a trial window wtry with size taking valuesin multiplicative order of wmin, where wmin can be definedas the granularity of the resource allocation in the cloud(i.e., it is the minimum reservation time that the cloud providerrequires for any amount of resource reserved in thecloud), and it is measured in units of time. To investigatethe impact of the value of wmin on the performance of ouralgorithm, we compared the financial cost of media streamingwhen using our algorithm for varied sizes of wmin ath ¼ 0:75. To plot the comparison figure, we computed theratio of the overall cost of resource reservation for everyvalue of wmin to the overall cost when using wmin ¼ 1 (i.e.,normalized cost) (Fig. 7).Fig. 5. The evolution of interest in the video channel.Fig. 6. A tariff function for units of reserved resources equal to 3.Fig. 7. Performance versus complexity of the PBRA algorithm forresource reservation in the cloud.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1029The results show that the algorithm provides the leastcost of resource allocation in the cloud when wmin ¼ 1.Hence, we can see that the finer granularity that we havein resource allocation in the cloud (i.e., the smaller value ofwmin), the better performance we get in our algorithm. Thebetter performance, however, comes at higher algorithmcomplexity, where complexity is measured in terms of totalnumber of iterations (h). We can see that h is higher forsmaller wmin (Fig. 7). However, even for the highest numberof iterations (when wmin ¼ 1), total CPU time was only1:02 second using Intel(R) Core(TM)2 Quad CPU @2.82 GHz. If we compare this execution time with theperiod of time over which the algorithm is operating0 _ tðsecÞ _ 1;000 (Fig. 5), we can see that our algorithmexecutes very efficiently.6.2.2 Comparison with Other Resource ProvisioningAlgorithmsRecall that our proposed algorithm for resource reservationin the cloud (PBRA) is based on windows with variablesizes (i.e., variable time slots as shown in Fig. 3). The sizeof every window and the amount of reserved resources inevery window is determined to minimize the financial coston the media content provider. We evaluate the performanceof our PBRA algorithm against two other resourceprovisioning schemes: fixed window size scheme (denotedby fixed-reserve-time), and the pay-as-you-go resourceallocation scheme which is widely used in the clouds(denoted by pay-as-you-go). The fixed window sizescheme is based on resource reservation wherein all timeslots(windows) are of the same size (i.e., wj is the same8j). The pay-as-you-go scheme is based on on-demandresource allocation wherein resources are allocated uponneeded. The price of reserved resources is less than the ondemandresources since time-discounted rates are onlygiven to the reserved resources.We computed the overall financial cost when using eachof the above schemes for resource allocation in the cloud.To plot the comparison figure, we computed the ratio ofthe overall cost for every value of wmin to the cost whenusing our PBRA algorithm with wmin ¼ 1 (Normalizedcost) (Fig. 8). In the case of Fixed-reserve-time, we set wjalways fixed as wj ¼ wmin 8j, and wj ¼ 10. We can see thatPBRA outperforms the Fixed-reserve-time scheme for allvalues of wmin. This is because PBRA selects window sizesaccording to the predicated demand such that the rightamount of resource is reserved in the cloud that maximallybenefits from the time-discount rates in the tariffs, andensures that reserved resources meet the actual demandwithout incurring wastage. PBRA also outperforms thepay-as-you-go scheme because it maximally benefits fromthe time-discounted rates given to the reserved resources,while no discount is given to resources allocated using theon-demand scheme.6.2.3 Impact of Different Probability Distributions of theDemandIn the next set of evaluations, we considered three log-normalprobability distribution functions for the demand withsame mean but varied variances. The mean of all log-normaldistributions E½DðtÞ_ is given in Eq. (8), where IðtÞ is givenin Fig. 5, mQ ¼ 1, while variances of the log-normal distributionswere set to 3, 6, and 8.The stochastic effect of demand on the cost of reservedresources using PBRA is shown in Table 2 when h ¼ 0:75.We observe that the overall resource reservation costincreases as the variance of the log-normal distributionincreases. This is because larger variance means higher likelihoodthat the reserved resources in the cloud do not meetthe actual demand. Consequently, higher reserved resourcesare required in the cloud to meet the actual demandgiven a certain probabilistic confidence h, which results inhigher cost for resource reservation in the cloud.6.3 Evaluation of the Hybrid Approach for ResourceAllocation in the CloudIn this section, we evaluate the performance of our hybridresource allocation algorithm proposed in Section 5. Ourhybrid approach enables the media content provider to efficientlyallocate resources in the cloud using both the reservationresource provisioning plan and the on-demandresource provisioning plan offered by the cloud provider.As we have discussed in Section 5, the right value ofparameter h has to be determined for this hybrid approachto optimally perform. To investigate the impact of differentvalues of h on the performance of the hybrid approach, weconsidered continuous non-linear tariffs that are functionsof both the allocated resources and reservation time. Weused time-discount rates similar to those used in the pricingmodel employed by Amazon EC2 [6] in order to derive tarifffunctions that we used in our evaluations. Time discountrates are only offered to reserved resources, while no timediscount rates are offered to resources allocated using theon-demand plan. An example of a tariff function that weFig. 8. Performance comparisons.TABLE 2Media Streaming Cost Given Different Probability Distributions of the Demand (in $)1030 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015used in our evaluations for units of allocated resourcesequal 3 is depicted in Fig. 6. Referring to Fig. 6, if the averageunits of resources allocated in the cloud for 6 time unitsusing the on-demand plan is 3, then the cost is 15 _ 6 ¼ $90;whereas if the media content provider reserves (prepaidpurchase) the same amount of resources for 6 time unitsusing the reservation plan, then the price charged is only13 _ 6 ¼ $78.In the next set of simulations, we consider a demandwith mean E½DðtÞ_ given in Eq. (8), where IðtÞ is given inFig. 5, mQ ¼ 1, and variance of 3. Recall that our hybridapproach selects the right value of h in every window. Inevery window j, different values of h are tested to selectsthe one that yields the least overall cost. Table 3 show thecost of resources allocated using both the resource reservationplan and resource on-demand plan when j ¼ 7 (correspondingto t ¼ 650), which results from using our hybridalgorithm. We observe that when h increases, the cost of theresources allocated using the reservation plan increases,while the cost of resources allocated using the on-demandplan decreases. This is because higher amount of reservedresources is required in the cloud for higher h and, consequently,less amount of on-demand resources is needed. Wealso observe that when h increases from 0:75 to 0:8 the overallcost (i.e., the cost of both reservation and on-demandresources) decreases; whereas when h increases beyond 0:8the overall cost increases. This is because the over-subscribed(over-provisioning) cost of the reserved resourcesbecomes very high when h > 0:8. We can see that the optimumvalue of h (i.e., the value of h that yields the least overallcost) when j ¼ 7 is about 0:8.To get a sense of how the optimal selection of thevalue of h can significantly reduce the overall monetarycost on the media content provider when using thishybrid streaming resource provisioning approach, let uscompare the total cost when using our hybrid resourceallocation algorithm at j ¼ 7 against two cases: the casewhen the media content provider uses the on-demandplan only (pay-as-you-go), and the case when the mediacontent provider uses the reservation plan only (fixedreserve-time). We observed that the cost of our hybridapproach when h_ ¼ 0:8 is $45;833; while the cost of allocatedresource in the case of pay-as-you-go is fixed atabout $52;000 (does not depend on the value of h), andthe cost of allocated resources in the case of fixed-reservetimewhen h ¼ 0:8 is about $48;000 (Fig. 9). Hence, ouralgorithm reduces the cost by an amount of about $6;200compared to pay-as-you-go (i.e., about 12 percent cost saving),and reduces the cost by an amount of $2;200 comparedto fixed-reserve-time (i.e., 4:5 percent cost saving).We note here that the cost was computed for onlyone video channel. However, a media content providergenerally offers hundreds of video channels to its clients.Therefore, the overall cost-saving using our proposedalgorithm can be significantly high for large number ofvideo channels offered by the media content provider.7 CONCLUSION AND FUTURE WORKThis paper studies the problem of resource allocations in thecloud for media streaming applications. We have considerednon-linear time-discount tariffs that a cloud providercharges for resources reserved in the cloud. We have proposedalgorithms that optimally determine both the amountof reserved resources in the cloud and their reservationtime—based on prediction of future demand for streamingcapacity—such that the financial cost on the media contentprovider is minimized. The proposed algorithms exploit thetime discounted rates in the tariffs, while ensuring that sufficientresources are reserved in the cloud without incurringwastage. We have evaluated the performance of our algorithmsnumerically and using simulations. The results showthat our algorithms adjust the tradeoff between resourcesreserved on the cloud and resources allocated on-demand.In future work, we shall perform experimental measurementsto characterize the streaming demand in the Internetand develop our own demand forecasting module. We shallalso investigate the case of multiple cloud providers andconsider the market competition when allocating resourcesin the clouds.ACKNOWLEDGMENTSThis work was supported by the National Center of Electronics,Communication, and Photonics at King AbdulazizCity for Science and Technology (Saudi Arabia). This paperwas based in part on a paper appeared in the proceeding ofthe IEEE Globecom 2012.TABLE 3Media Streaming Cost Using Two Resource Allocation Plans Provided by the Cloud (Hybrid Resource Provisioning Approach) (in $)Fig. 9. Hybrid approach performance comparisons.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS

Improving Web Navigation Usability by Comparing Actual and Anticipated Usage

05/08/201902/07/2019 by admin

We present a new method to identify navigation related Web usability problems based on comparing actual and anticipated usage patterns. The actual usage patterns can be extracted from Web server logs routinely recorded for operational websites by first processing the log data to identify users, user sessions, and user task-oriented transactions, and then applying a usage mining algorithm to discover patterns among actual usage paths. The anticipated usage, including information about both the path and time required for user-oriented tasks, is captured by our ideal user interactive path models constructed by cognitive experts based on their cognition of user behavior.

The comparison is performed via the mechanism of test MY SQL for checking results and identifying user navigation difficulties. The deviation data produced from this comparison can help us discover usability issues and suggest corrective actions to improve usability. A software tool was developed to automate a significant part of the activities involved. With an experiment on a small service-oriented website, we identified usability problems, which were cross-validated by domain experts, and quantified usability improvement by the higher task success rate and lower time and effort for given tasks after suggested corrections were implemented. This case study provides an initial validation of the applicability and effectiveness of our method.

1.2 INTRODUCTION

As the World Wide Web becomes prevalent today, building and ensuring easy-to-use Web systems is becoming a core competency for business survival. Usability is defined as the effectiveness, efficiency, and satisfaction with which specific users can complete specific tasks in a particular environment. Three basic Web design principles, i.e., structural firmness, functional convenience, and presentational delight, were identified to help improve users’ online experience. Structural firmness relates primarily to the characteristics that influence the website security and performance. Functional convenience refers to the availability of convenient characteristics, such as a site’s ease of use and ease of navigation, that help users’ interaction with the interface. Presentational delight refers to the website characteristics that stimulate users’ senses. Usability engineering provides methods for measuring usability and for addressing usability issues. Heuristic evaluation by experts and user-centered testing are typically used to identify usability issues and to ensure satisfactory usability.

However, significant challenges exist, including 1) accuracy of problem identification due to false alarms common in expert evaluation 2) unrealistic evaluation of usability due to differences between the testing environment and the actual usage environment, and 3) increased cost due to the prolonged evolution and maintenance cycles typical for many Web applications. On the other hand, log data routinely kept at Web servers represent actual usage. Such data have been used for usage-based testing and quality assurance and also for understanding user behavior and guiding user interface design.

Server-side logs can be automatically generated by Web servers, with each entry corresponding to a user request. By analyzing these logs, Web workload was characterized and used to suggest performance enhancements for Internet Web servers. Because of the vastly uneven Web traffic, massive user population, and diverse usage environment, coverage-based testing is insufficient to ensure the quality of Web applications. Therefore, server-side logs have been used to construct Web usage models for usage-based Web testing or to automatically generate test cases accordingly to improve test efficiency.

1.3 LITRATURE SURVEY

WEB USABILITY PROBE: A TOOL FOR SUPPORTING REMOTE USABILITY EVALUATION OF WEB SITES

PUBLICATION: Human-Computer Interaction—INTERACT 2011. New York, NY, USA: Springer, 2011,pp. 349–357.

AUTHORS: T. Carta, F. Patern`o, and V. F. D. Santana

EXPLANATION:

Usability evaluation of Web sites is still a difficult and time-consuming task, often performed manually. This paper presents a tool that supports remote usability evaluation of Web sites. The tool considers client-side data on user interactions and JavaScript events. In addition, it allows the definition of custom events, giving evaluators the flexibility to add specific events to be detected and considered in the evaluation. The tool supports evaluation of any Web site by exploiting a proxy-based architecture and enables the evaluator to perform a comparison between actual user behavior and an optimal sequence of actions.

SUPPORTING ACTIVITY MODELLING FROM ACTIVITY TRACES

PUBLICATION: Expert Syst., vol. 29, no. 3, pp. 261–275, 2012.

AUTHORS: O. L. Georgeon, A. Mille, T. Bellet, B. Mathern, and F. E. Ritter,

EXPLANATION:

We present a new method and tool for activity modelling through qualitative sequential data analysis. In particular, we address the question of constructing a symbolic abstract representation of an activity from an activity trace. We use knowledge engineering techniques to help the analyst build ontology of the activity, that is, a set of symbols and hierarchical semantics that supports the construction of activity models. The ontology construction is pragmatic, evolutionist and driven by the analyst in accordance with their modelling goals and their research questions. Our tool helps the analyst define transformation rules to process the raw trace into abstract traces based on the ontology. The analyst visualizes the abstract traces and iteratively tests the ontology, the transformation rules and the visualization format to confirm the models of activity. With this tool and this method, we found innovative ways to represent a car-driving activity at different levels of abstraction from activity traces collected from an instrumented vehicle. As examples, we report two new strategies of lane changing on motorways that we have found and modelled with this approach.

TOOLS FOR REMOTE USABILITY EVALUATION OF WEB APPLICATIONS THROUGH BROWSER LOGS AND TASK MODELS

PUBLICATION: Behavior Res.Methods, Instrum., Comput., vol. 35, no. 3, pp. 369–378, 2003

AUTHORS: L. Paganelli and F. Patern`o,

EXPLANATION:

The dissemination of Web applications is extensive and still growing. The great penetration of Web sites raises a number of challenges for usability evaluators. Video-based analysis can be rather expensive and may provide limited results. In this article, we discuss what information can be provided by automatic tools able to process the information contained in browser logs and task models. To this end, we present a tool that can be used to compare log files of user behavior with the task model representing the actual Web site design, in order to identify where users’ interactions deviate from those envisioned by the system design.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Previous studies usability has long been addressed and discussed, when people navigate the Web they often encounter a number of usability issues. This is also due to the fact that Web surfers often decide on the spur of the moment what to do and whether to continue to navigate in a Web site. Usability evaluation is thus an important phase in the deployment of Web applications. For this purpose automatic tools are very useful to gather larger amount of usability data and support their analysis.

Remote evaluation implies that users and evaluators are separated in time and/or space. This is important in order to analyse users in their daily environments and decreases the costs of the evaluation without requiring the use of specific laboratories and asking the users to move. In addition, tools for remote Web usability evaluation should be sufficiently general so that they can be used to analyse user behaviour even when using various browsers or applications developed using different toolkits. We prefer logging on the client-side in order to be able to capture any user-generated events, which can provide useful hints regarding possible usability problems.

Existing approaches have been used to support usability evaluation. An example was WebRemUsine, which was a tool for remote usability evaluation of Web applications through browser logs and task models. Propp and Frorbrig have used task models for supporting usability evaluation of a different type of application: cooperative behaviour of people interacting in smart environments. A different use of models is in the authors discuss how task models can enhance visualization of the usability test log. In our case we do not require the effort of developing models to apply our tool. We only require that the designer provides an example of optimal use associated with each of the relevant tasks. The tool will then compare the logs with the actual use with the optimal log in order to identify deviations, which may indicate potential usability problems.

2.1.1 DISADVANTAGES:

Web navigate used a logger to collect data from a user session test on a Web interface prototype running on a PDA simulator in order to evaluate different types of Web navigation tools and identify the best one for small display devices.

Users were asked to find the answer to specific questions using different types of navigation tools to move from one page to another. A database was used to store users’ actions, but they logged only the answer given by the user to each specific question. Moreover they stored separately every term searched by the user by means of the internal search tool.

Client-side data encounters different challenges regarding the identification of the elements that users are interacting with, how to manage element identification when the page is changed dynamically, how to manage data logging when users are going from one page to another, amongst others. The following are some of the solutions we adopted in order to deal with these issues.

2.2 PROPOSED SYSTEM:

We propose a new method to identify navigation related usability problems by comparing Web usage patterns extracted from server logs against anticipated usage represented in some cognitive user models (RQ2). Fig. 1 shows the architecture of our method. It includes three major modules: Usage Pattern Extraction, IUIP Modeling, and Usability Problem Identification. First, we extract actual navigation paths from server logs and discover patterns for some typical events. In parallel, we construct IUIP models for the same events. IUIP models are based on the cognition of user behavior and can represent anticipated paths for specific user-oriented tasks.

Our IUIP models are based on the cognitive models surveyed in Section II, particularly the ACT-R model. Due to the complexity of ACT-R model development and the low-level rule based programming language it relies on we constructed our own cognitive architecture and supporting tool based on the ideas from ACT-R. In general, the user behavior patterns can be traced with a sequence of states and transitions. Our IUIP consists of a number of states and transitions. For a particular goal, a sequence of related operation rules can be specified for a series of transitions. Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost.

Typically, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner. Based on this cognitive mechanism, IUIP models our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

2.2.1 ADVANTAGES:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated follow up pages will not be used themselves for deviation calculations to avoid double counting.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

IUIP MODELS:

Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. For example, human factors guidelines specify the upper bound for the response time to mitigate the risk that users will lose interest in a website. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner on this cognitive mechanism, IUIP models need to be constructed individually for novices and experts by cognitive experts by utilizing their domain expertise and their knowledge of different users’ interactive behavior.

We can adapt the durations by performing iterative tests with different users Diagrammatic notation methods and tools are often used to support interaction modeling and task performance evaluation IUIP model construction and reuse, we used C++ and XML to develop our IUIP modeling tool based on the open-source visual diagram software DIA. DIA allows users to draw customized diagrams, such as UML, data flow, and other diagrams. Existing shapes and lines in DIA form part of the graphic notations in our IUIP models. New ones can be easily added by writing simple XML files. The operations, operation rules, and computation rules can be embedded into the graphic notations with XML schema we defined to form our IUIP symbols. Currently, about 20 IUIP symbols have been created to represent typical Web interactions. IUIP symbols used in subsequent examples are explained at the bottom of cognitive experts can use our IUIP modeling tool to develop various IUIP models for different Web applications.

The actual users’ navigation trails we extracted from the aggregated trail tree are compared against corresponding IUIP models automatically. This comparison will yield a set of deviations between the two. We can identify some common problems of actual users’ interaction with the Web application by focusing on deviations that occur frequently. Combined with expertise in product internal and contextual information, our results can also help identify the root causes of some usability problems existing in the Web design. Based on logical choices made and time spent by users at each page, the calculation of deviations between actual users’ usage patterns and IUIP can be divided into two parts:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The IUIP model for the task “First Selection” is shown on the top. The corresponding user Trail 7, a part of a trail tree extracted from log data, is presented under it. The node in the tree is annotated with the number of users having reached the node across the same trail prefix. The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated followup pages will not be used themselves for deviation calculations to avoid double counting.

4.1 ALGORITHM

TRAIL TREE USAGE MINING ALGORITHM

The transactions identified from each user session form a collection of paths use the trie data structure to merge the paths along common prefixes. A trie, or a prefix tree, is an ordered tree used to store an associative array where the keys are usually strings. All the descendants of a node have a common prefix of the string associated with that node. The root is associated with the empty string. We adapted the trie algorithm to construct a tree structure that also captures user visit frequencies, which is called a trail tree in our work. In a trail tree, a complete path from the root to a leaf node is called a trail.

The leaf nodes of the trail tree are also annotated with the trail names. The transaction paths extracted from the Web server log are shown in the table to its left, together with path occurrence frequencies. Paths 1, 4, and 5 have the common first node a; therefore, they were merged together. For the second node of this subtree, Paths 1 and 4 both accessed Page b; therefore, the two paths were combined at Node b. Finally, Paths 1 and 4 were merged into a single trail, Trail 1, although Path 1 terminates at Node e. By the same method, the other paths can be integrated into the trail tree. The number at each edge indicates the number of users reaching the next node across the same trail prefix.

Based on the aggregated trail tree, further mining can be performed for some “interesting” pattern discovery. Typically, good mining results require a close interaction of the human experts to specify the characteristics that make navigation patterns interesting. In our method, we focus on the paths which are used by a sufficient number of users to finish a specific task. The paths can be initially prioritized by their usage frequencies and selected by using a threshold specified by the experts. Application-domain knowledge and contextual information, such as criticality of specific tasks, user privileges, etc., can also be used to identified “interesting” patterns. For the FG 2009 website, we extracted 30 trails each for Tasks 1, 2, and 3, and 5 trails for Task 4.

4.2 MODULES:

COGNITIVE USER MODEL:

WEB SERVER USER LOG:

USAGE PATTERN EXTRACTION:

USABILITY MEASURING:

4.3 MODULE DESCRIPTION:

COGNITIVE USER MODEL:

User Models is a growing need to incorporate insights from cognitive science about the mechanisms, strengths, and limits of human perception and cognition to understand the human factors involved in user interface design in the various constraints on cognition (e.g., system complexity) and the mechanisms and patterns of strategy selection can help human factor engineers develop solutions and apply technologies that are better suited to human abilities.

Commonly used cognitive models include GOMS, EPIC, and ACT-R. The GOMS model consists of Goals, Operators, Methods, and Selection rules. As the high-level architecture, GOMS describes behavior and defines interactions as a static sequence of human actions. As the low-level cognitive architecture, EPIC (Executive-Process/Interactive Control) and ACT-R (Adaptive Control of Thought-Rational) can be taken as the specific implementation of the high-level architecture.

They provide detailed information about how to simulate human processing and cognition important feature of these low-level cognitive architectures is that they are all implemented as computer programming systems so that cognitive models may be specified, executed, and their outputs (e.g., error rates and response latencies) compared with human performance data.

WEB SERVER USER LOG:

Server logs have also been used by organizations to learn about the usability of their products. For example, search queries can be extracted from server logs to discover user information needs for usability task analysis. There are many advantages to using server logs for usability studies. Logs can provide insight into real users performing actual tasks in natural working conditions versus in an artificial setting of a lab. Logs also represent the activities of many users over a long period of time versus the small sample of users in a short time span in typical lab testing. Data preparation techniques and algorithms can be used to process the raw Web server logs, and then mining can be performed to discover users’ visitation patterns for further usability analysis.

For example, organizations can mine server-side logs to predict users’ behavior and context to satisfy users’ revisitiation patterns can be discovered by mining server logs to develop guidelines for browser history mechanism that can be used to reduce users’ cognitive and physical effort Client-side logs can capture accurate comprehensive usage data for usability analysis, because they allow low-level user interaction events such as keystrokes and mouse movements to be recorded.

For example, using these client-side data, the evaluator can accurately measure time spent on particular tasks or pages as well as study the use of “back” button and user click streams. Such data are often used with task based approaches and models for usability analysis by comparing discrepancies between the designer’s anticipation and a user’s actual behavior. However, the evaluator must program the UI, modify Web pages, or use an instrumented browser with plug-in tools or a special proxy server to collect such data.

USAGE PATTERN EXTRACTION:

Web server logs are our data source. Each entry in a log contains the IP address of the originating host, the timestamp, the requested Web page, the referrer, the user agent and other data. Typically, the raw data need to be preprocessed and converted into user sessions and transactions to extract usage patterns.

The data preparation and preprocessing include the following domain-dependent tasks.

1) Data cleaning: This task is usually site-specific and involves removing extraneous references to style files, graphics, or sound files that may not be important for the purpose of our analysis.

2) User identification: The remaining entries are grouped by individual users. Because no user authentication and cookie information is available in most server logs, we used the combination of IP, user agent, and referrer fields to identify unique users.

3) User session identification: The activity record of each user is segmented into sessions, with each representing a single visit to a site. Without additional authentication information from users and without the mechanisms such as embedded session IDs, one must rely on heuristics for session identification. For example, we set an elapse time of 15 min between two successive page accesses as a threshold to partition a user activity record into different sessions.

4) Path completion: Client or proxy side caching can often result in missing access references to some pages that have been cached. These missing references can often be heuristically inferred from the knowledge of site topology and referrer information, along with temporal information from server logs.

These tasks are time consuming and computationally intensive, but essential to the successful discovery of usage patterns.

We developed a tool to automate all these tasks except part of path completion. For path completion, the designers or developers first need to manually discover the rules of missing references based on site structure, referrer, and other heuristic information. Once the repeated patterns are identified, this work can be automatically carried out. Our tool can work with server logs of different Web applications by modifying the related parameters in the configuration file. The processed log data are stored into a database for further use.

USABILITY MEASURING:

Our specific results from applying our method to the FG 2009 website we collected Web server access log data for the first three days after its deployment. The server log includes about above 500 entries. After preprocessing the raw log data using our tool, we identified 58 unique users and 81 sessions. Then, we constructed four event models for four typical tasks. We extracted 95 trails for these tasks. Meanwhile, a designer with three-year GUI design experience and an expert with five-year experience with human factors practice for the Web constructed four IUIP models for the same tasks based on their cognition of users’ interactive behavior. By checking the extracted usage patterns against the four IUIP models, we obtained logical and temporal deviations shown in Tables I and II and identified 17 usability issues or potential usability problems. Some usability issues were identified by both logical and temporal deviation analyses. Next, we further analyze these deviations for usability problem identification and improvement.

In Table I, 16 deviations took place in the page “index.php.” The unanticipated followup page is the page “login.php,” followed by the page “index.php?f=t” (login failure). Further reviewing the index page, we found that the page design is too simplistic: No instruction was provided to help users to login or register. We inferred that some users with limited online shopping experience were trying to use their regular email addresses and passwords to log in to the FG 2009 website.

We also found some structure design issues. For example, we observed that some users repeatedly visited the page “Selection Rules.” It is likely that when the users were not permitted to select any furniture in some categories (the FG website limited each user to select one piece of furniture under each category), they had to go to the page “Selection Rules” to find the reasons. To reduce these redundant operations and improve user experience, the help function for selection rules should be redesigned to make it more convenient for users to consult.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 8

8.1 CONCLUSION:

We have developed a new method for the identification and improvement of navigation-related Web usability problems by checking extracted usage patterns against cognitive user models. As demonstrated by our case study, our method can identify areas with usability issues to help improve the usability of Web systems. Once a website is operational, our method can be continuously applied and drive ongoing refinements. In contrast with traditional software products and systems, Web based applications have shortened development cycles and prolonged maintenance cycles. Our method can contribute significantly to continuous usability improvement over these prolonged maintenance cycles. The usability improvement in successive iterations can be quantified by the progressively better effectiveness (higher task completion rate) and efficiency (less time for given tasks).

Our method is not intended to and cannot replace heuristic usability evaluation by experts and user-centered usability testing. It complements these traditional usability practices and can be incorporated into an integrated strategy for Web usability assurance. With automated tool support for a significant part of the activities involved, our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

8.2 FUTURE ENHANCEMENT: In the future, we should and must carry out validation studies with large-scale Web applications. We also plan to explore additional approaches to discover Web usage patterns and related usability problems generalizable to other interesting domains. For example, we have already started exploring deviation calculation and analysis at the trail level instead of at the individual page level. Such analyses might be more meaningful and yield more interesting results for Web applications with complex structure and operation sequences. Our IUIP modeling architecture and supporting tools also need to be further enhanced and optimized for more complex tasks. We will also further expand our usability research to cover more usability aspects to improve Web users’ overall satisfaction.

Improving Physical-Layer Security in Wireless Communications Using Diversity Techniques

05/08/201902/07/2019 by admin

n wireless networks, transmission between legitimateusers can easily be overheard by an eavesdropper forinterception due to the broadcast nature of the wirelessmedium, making wireless transmission highly vulnerableto eavesdropping attacks. In order to achieve confidentialtransmission, existing communications systems typically adoptthe cryptographic techniques to prevent an eavesdropper fromtapping data transmission between legitimate users [1, 2]. Byconsidering symmetric key encryption as an example, the originaldata (called plaintext) is first encrypted at the sourcenode by using an encryption algorithm along with a secret keythat is shared only with the destination node. Then theencrypted plaintext (also known as ciphertext) is transmittedto the destination, which will decrypt its received ciphertextwith the preshared secret key. In this way, even if an eavesdropperoverhears the ciphertext transmission, it is still difficultfor the eavesdropper to interpret the plaintext from itsintercepted ciphertext without the secret key. It is pointed outthat ciphertext transmission is not perfectly secure, since theciphertext can still be decrypted by an eavesdropper throughan exhaustive key search, which is also known as a brute-forceattack. To this end, physical-layer security is emerging as analternative paradigm to protect wireless communicationsagainst eavesdropping attacks, including brute-force attacks.Physical-layer security work was pioneered by Wyner in [3],where a discrete memoryless wiretap channel was examinedfor secure communications in the presence of an eavesdropper.It was proved in [3] that perfectly secure data transmissioncan be achieved if the channel capacity of the main link(from source to destination) is higher than that of the wiretaplink (from source to eavesdropper). Later on, in [4], Wyner’sresults were extended from the discrete memoryless wiretapchannel to the Gaussian wiretap channel, where a so-calledsecrecy capacity was developed, and shown as the differencebetween the channel capacity of the main link and that of thewiretap link. If the secrecy capacity falls below zero, the transmissionfrom source to destination becomes insecure, and theeavesdropper can succeed in intercepting the source transmission(i.e., an intercept event occurs). In order to improvetransmission security against eavesdropping attacks, it is ofimportance to reduce the probability of occurrence of anintercept event (called intercept probability) through enlargingsecrecy capacity. However, in wireless communications, secrecycapacity is severely degraded due to the fading effect.42 IEEE Network • January/February 2015AbstractDue to the broadcast nature of radio propagation, wireless transmission can bereadily overheard by unauthorized users for interception purposes and is thus highlyvulnerable to eavesdropping attacks. To this end, physical-layer security isemerging as a promising paradigm to protect the wireless communications againsteavesdropping attacks by exploiting the physical characteristics of wireless channels.This article is focused on the investigation of diversity techniques to improvephysical-layer security differently from the conventional artificial noise generationand beamforming techniques, which typically consume additional power for generatingartificial noise and exhibit high implementation complexity for beamformerdesign. We present several diversity approaches to improve wireless physical-layersecurity, including multiple-input multiple-output (MIMO), multiuser diversity, andcooperative diversity. To illustrate the security improvement through diversity, wepropose a case study of exploiting cooperative relays to assist the signal transmissionfrom source to destination while defending against eavesdropping attacks.We evaluate the security performance of cooperative relay transmission in Rayleighfading environments in terms of secrecy capacity and intercept probability. It isshown that as the number of relays increases, both the secrecy capacity and interceptprobability of cooperative relay transmission improve significantly, implyingthere is an advantage in exploiting cooperative diversity to improve physical-layersecurity against eavesdropping attacks.Improving Physical-Layer Security inWireless CommunicationsUsing Diversity TechniquesYulong Zou, Jia Zhu, Xianbin Wang, and Victor C.M. LeungI0890-8044/15/$25.00 © 2015 IEEEYulong Zou and Jia Zhu are with the Nanjing University of Posts andTelecommunications.Xianbin Wang is with the University of Western Ontario.Victor C.M. Leung is with the University of British Columbia.As a consequence, there are extensive works aimed atincreasing the secrecy capacity of wireless communications byexploiting multiple antennas [5] and cooperative relays [6].Specifically, the multiple-input multiple-output (MIMO) wiretapchannel was studied in [7] to enhance the wireless secrecycapacity in fading environments. In [8], cooperative relayswere examined for improving the physical-layer security interms of the secrecy rate performance. A hybrid cooperativebeamforming and jamming approach was investigated in [9] toenhance the wireless secrecy capacity, where partial relaynodes are allowed to assist the source transmission to thelegitimate destination with the aid of distributed beamforming,while the remaining relay nodes are used to transmit artificialnoise to confuse the eavesdropper. More recently, ajoint physical-application layer security framework was proposedin [10] for improving the security of wireless multimediadelivery by simultaneously exploiting physical-layer signalprocessing techniques as well as upper-layer authenticationand watermarking methods. In [11], error control coding forsecrecy was discussed for achieving the physical-layer security.Additionally, in [12, 13], physical-layer security was furtherinvestigated in emerging cognitive radio networks.At the time of writing, most research efforts are devoted toexamining the artificial noise and beamforming techniques tocombat eavesdropping attacks, but they consume additionalpower resources to generating artificial noise and increase thecomputational complexity in performing beamformer design.Therefore, this article is motivated to enhance the physicallayersecurity through diversity techniques without additionalpower costs, including MIMO, multiuser diversity, and cooperativediversity, aimed at increasing the capacity of the mainchannel while degrading the wiretap channel. For illustrationpurposes, we present a case study of exploiting cooperativerelays to improve the physical-layer security against eavesdroppingattacks, where the best relay is selected and used toparticipate in forwarding the signal transmission from sourceto destination. We evaluate the secrecy capacity and interceptprobability of the proposed cooperative relay transmission inRayleigh fading environments. It is shown that with anincreasing number of relays, the security performance ofcooperative relay transmission significantly improves in termsof secrecy capacity and intercept probability. This confirmsthe advantage of using cooperative relays to protect wirelesscommunications against eavesdropping attacks.The remainder of this article is organized as follows. Thenext section presents the system model of physical-layer securityin wireless communications. After that, we focus on thephysical-layer security enhancement through diversity techniques,including MIMO, multiuser diversity, and cooperativediversity. For the purpose of illustrating the security improvementthrough diversity, we present a case study of exploitingcooperative relays to assist signal transmission from source todestination against eavesdropping attacks. Finally, we providesome concluding remarks.Physical-Layer Security in WirelessCommunicationsFigure 1 shows a wireless communications scenario with onesource and one destination in the presence of an eavesdropper,where the solid and dashed lines represent the mainchannel (from source to destination) and the wiretap channel(from source to eavesdropper), respectively. When the sourcenode transmits its signal to the destination, an eavesdroppermay overhear such transmission due to the broadcast natureof the wireless medium. Considering the fact that today’swireless systems are highly standardized, the eavesdropper canreadily obtain the transmission parameters, including the signalwaveform, coding and modulation scheme, encryptionalgorithm, and so on. Also, the secret key may be figured outat the eavesdropper (e.g., through an exhaustive search).Thus, the source signal could be interpreted at the eavesdropperby decoding its overheard signal, leading to insecurity ofthe legitimate transmission.As a result, physical-layer security emerges as an alternativemeans to achieve perfect transmission secrecy from source todestination. In the physical-layer security literature [3, 4], aso-called secrecy capacity is developed and shown as the differencebetween the capacities of the main link and the wiretaplink. It has been proven that perfect secrecy is achieved if thesecrecy capacity is positive, meaning that when the main channelcapacity is larger than the wiretap channel capacity, thetransmission from source to destination can be perfectlysecure. This can be explained by using the Shannon codingtheorem from which it is impossible for a receiver to recoverthe source signal if the channel capacity (from source toreceiver) is smaller than the data rate. Thus, given a positivesecrecy capacity, the data rate can be adjusted between thecapacities of the main and wiretap channels so that the destinationnode successfully decodes the source signal and theeavesdropper fails to decode it. However, if the secrecy capacityis negative (i.e., the main channel capacity falls below thewiretap channel capacity), the eavesdropper is more likelythan the destination to succeed in decoding the source signal.In an information-theoretic sense, when the main channelcapacity becomes smaller than the wiretap channel capacity, itis impossible to guarantee that the destination succeeds andthe eavesdropper fails to decode the source signal. Therefore,an intercept event is seen to occur when the secrecy capacityfalls below zero, and the probability of occurrence of an interceptevent is called intercept probability throughout this article.At present, most existing work is focused on improvingphysical-layer security by generating artificial noise to confusean eavesdropping attack, where the artificial noise is sophisticatedlyproduced such that only the eavesdropper experiencesinterference, and the desired destination can easily cancel outsuch noise without performance degradation. More specifically,given a main channel matrix Hm, the artificial noise (denot-IEEE Network • January/February 2015 43Figure 1. A wireless communications scenario consisting ofone source and one destination in the presence of an eavesdroppingattack.Main linkWiretap linkDestinationDEavesdropperESourceSed by wn) is designed in the null space of matrix Hm such thatHmwn = 0, making the desired destination unaffected by thenoise. Since the wiretap channel is independent of the mainchannel, the null space of the wiretap channel is in generaldifferent from that of the main channel; thus, the eavesdroppercannot null out the artificial noise, which results in theperformance degradation at the eavesdropper. Notice that theabove-mentioned null space based noise generation approachneeds the knowledge of main channel Hm only, which can befurther optimized if the wiretap channel information is alsoknown. It needs to be pointed out that additional powerresources are required for generating artificial noise to confusethe eavesdropper. For a fair comparison, the total transmitpower of artificial noise and desired signal should beconstrained. Also, the power allocation between the artificialnoise and desired signal is important, and should be adaptedto the main and wiretap channels to optimize the physicallayersecurity performance, for example, in terms of secrecycapacity. Different from the artificial noise generationapproach, this article is mainly focused on the investigation ofdiversity techniques for enhancing physical-layer security.Diversity for Physical-Layer SecurityIn this section, we present several diversity techniques toimprove physical-layer security against eavesdropping attacks.Traditionally, diversity techniques are exploited to increasetransmission reliability, which also have great potential toenhance wireless security. In the following, we discuss thephysical-layer security improvement through the use ofMIMO, multiuser diversity, and cooperative diversity, respectively.Notice that the MIMO and multiuser diversity mechanismsare generally applicable to various cellular and WiFinetworks, since the cellular and WiFi networks typically consistof multiple users, and, moreover, today’s cellular andWiFi devices are equipped with multiple antennas. In contrast,the cooperative diversity mechanism is only applicableto some advanced cellular and WiFi networks that haveadopted the relay architecture, such as the Long Term Evolution(LTE)-Advanced system and IEEE 802.16j/m, whererelay stations are introduced to assist wireless data transmission.MIMO DiversityThis subsection presents MIMO diversity for physical-layersecurity of wireless transmission against eavesdroppingattacks. As shown in Fig. 2, all the network nodes areequipped with multiple antennas, where M, Nd, and Ne representthe number of antennas at the source, destination, andeavesdropper, respectively. As is known, MIMO has beenshown as an effective means to combat wireless fading andincrease the capacity of the wireless channel. However, theeavesdropper can also exploit the MIMO structure to enlargethe capacity of a wiretap channel from the source to theeavesdropper. Thus, without proper design, increasing thesecrecy capacity of wireless transmission with MIMO may fail.For example, if conventional open-loop space-time block codingis considered, the destination should first estimate themain channel matrix Hm and then perform the space-timedecoding process with an estimated H^m, leading diversity gainto be achieved for the main channel. Similarly, the eavesdroppercan also estimate the wiretap channel matrix Hw and thenconduct the corresponding space-time decoding algorithm toobtain diversity gain for the wiretap channel. Hence, the conventionalspace-time block coding is not effective to improvephysical-layer security against eavesdropping attacks.Generally speaking, if the source node transmits its signalto the desired destination with M antennas, the eavesdropperalso receives M signal copies for interception purposes. Inorder to defend against eavesdropping attacks, the sourcenode should adopt a preprocess that needs to be adapted tothe main and wiretap channels Hm and Hw such that diversitygain can be achieved only at the destination, whereas theeavesdropper benefits nothing from the multiple transmitantennas at the source. This means that an adaptive transmitprocess should be included at the source node to increase themain channel capacity while decreasing the wiretap channelcapacity. Ideally, the objective of such an adaptive transmitprocess is to maximize the secrecy capacity of MIMO transmission,which, however, requires the channel state information(CSI) of both the main and wiretap links (i.e., Hm andHw). In practice, the wiretap channel information Hw may beunavailable, since the eavesdropper is usually passive andstays silent. If only the main channel information Hm isknown, the adaptive transmit process can be designed to maximizethe main channel capacity, which does not requireknowledge of wiretap channel Hw. Since the adaptive transmitprocess is optimized based on the main channel informationHm, and the wiretap channel is typically independent of themain channel, the main channel capacity is significantlyincreased with MIMO, and no improvement is achieved forthe wiretap channel capacity.As for the aforementioned adaptive transmit process, wehere present three main concrete approaches: transmit beamforming,power allocation, and transmit antenna selection.Transmit beamforming is a signal processing technique com-44 IEEE Network • January/February 2015Figure 2. A MIMO wireless system consisting of one sourceand one destination in the presence of an eavesdroppingattack.D(Nd)…DestinationE(Ne)…EavesdropperS(M)…SourceDesired linkWiretap linkbining multiple transmit antennas at the source node in such away that desired signals transmit in a particular direction tothe destination. Considering that the eavesdropper and destinationgenerally lie in different directions relative to thesource node, the desired signals (with transmit beamforming)received at the eavesdropper experience destructive interferenceand become very weak. Thus, transmit beamforming iseffective in defending against eavesdropping attacks when thedestination and eavesdropper are spatially separated. Thepower allocation maximizes the main channel capacity (orsecrecy capacity if both Hm and Hw are known) by allocatingthe transmit power among M antennas at the source. In thisway, the secrecy capacity of MIMO transmission is significantlyincreased, showing the security benefits of using power allocationagainst eavesdropping attacks. In addition, the transmitantenna selection is also able to improve the physical-layersecurity of MIMO wireless systems. Depending on whetherthe global CSI of the main and wiretap channels (i.e., Hm andHw) is available, an optimal transmit antenna at the sourcenode is selected and used to transmit source signals. Morespecifically, if both Hm and Hw are available, the transmitantenna with the highest secrecy capacity is chosen. Studyingthe case of the global available CSI provides a theoreticalupper bound on the security performance of wireless systems.Notice that the CSI of wiretap channels may be estimated andobtained by monitoring the eavesdroppers’ transmissions asdiscussed in [8] and [14]. If only Hm is known, the transmitantenna selection is to maximize the main channel capacity.One can observe that the above-mentioned three approaches(i.e., transmit beamforming, power allocation, and transmitantenna selection) all have great potential to improve thephysical-layer security of MIMO wireless systems againsteavesdropping attacks.Multiuser DiversityThis subsection discusses the multiuser diversity for improvingphysical-layer security. Figure 3 shows that a base station (BS)serves multiple users where M users are denoted by U = {Ui|i= 1, 2, ···, M}. In cellular networks, M users typically communicatewith a BS through an orthogonal multiple access mechanismsuch as orthogonal frequency-division multiple access(OFDMA) and time-division multiple access (TDMA). TakingOFDMA as an example, orthogonal frequency-divisionmultiplexing (OFDM) subcarriers are allocated to differentusers. In other words, given an OFDM subcarrier, we need todetermine which user should be assigned to access and usethe subcarrier for data transmission. Traditionally, the userwith the highest throughput is selected to access the givenOFDM subcarrier, aiming to maximize the transmissioncapacity. This relies on knowledge of main channel informationHm only and can provide significant multiuser diversitygain for performance improvement. However, if a user is faraway from a BS and experiences severe propagation loss anddeep fading, it may have no chance of being selected as the“best” user for channel access. To this end, user fairnessshould be further considered in multiuser scheduling, wheretwo competing interests need to be balanced: maximizing themain channel capacity while at the same time guaranteeingeach user with certain opportunities to access the channel.With multiuser scheduling, a user is first selected to accessa channel (i.e., an OFDM subcarrier in OFDMA or a timeslot in TDMA) and then starts transmitting its signal to a BS.Meanwhile, due to the broadcast nature of wireless transmission,an eavesdropper overhears such transmission andattempts to interpret the source signal. In order to effectivelydefend against the eavesdropping attack, multiuser schedulingshould be performed to minimize the wiretap channel capacitywhile maximizing the main channel capacity, which requiresthe CSI of both the main and wiretap links. If only the mainchannel information Hm is available, we may consider the useof conventional multiuser scheduling where the wiretap channelinformation Hw is not taken into account. It needs to bepointed out that conventional multiuser scheduling still hasgreat potential to enhance physical-layer security, since themain channel capacity is significantly improved with conventionalmultiuser scheduling while the wiretap channel capacityremains the same.Cooperative DiversityIn this subsection, we focus mainly on cooperative diversityfor wireless security against eavesdropping attacks. Figure 4shows a cooperative wireless network including one source, Mrelays, and one destination in the presence of an eavesdropper,where M relays are exploited to assist the signal transmissionfrom source to destination. To be specific, the sourcenode first transmits its signal to M relays, which then forwardtheir received source signals to the destination. At present,there are two basic relay protocols: amplify-and-forward (AF)and decode-and-forward (DF). In the AF protocol, a relaynode simply amplifies and retransmits its received noisy versionof the source signal to the destination. In contrast, theDF protocol requires the relay node to decode its receivedsignal and forward its decoded outcome to the destinationnode. It is concluded that multiple-relay-assisted source signaltransmission consists of two steps:1. The source node broadcasts its signal.2. Relay nodes retransmit their received signals.Each of the two transmission steps is vulnerable to eavesdroppingattack and needs to be carefully designed to prevent aneavesdropper from intercepting the source signal.Typically, the main channel capacity with multiple relayscan be significantly increased by using cooperative beamforming.More specifically, multiple relays can form a virtualantenna array and cooperate with each other to performtransmit beamforming such that the signals received at theintended destination experience constructive interference,while the others (received at the eavesdropper) experiencedestructive interference. One can observe that with cooperativebeamforming, the received signal strength of the destina-IEEE Network • January/February 2015 45Figure 3. A multiuser wireless communications system consistingof one base station (BS) and multiple users in the presenceof an eavesdropper….EBSU1U2UMDesired linkWiretap linktion is much higher than that of the eavesdropper, implyingphysical-layer security improvement. In addition to the aforementionedcooperative beamforming, the best relay selectionis another approach to improve wireless transmission securityagainst eavesdropping attacks. In the best relay selection, arelay node with the highest secrecy capacity (or highest mainchannel capacity if only the main channel information is available)is chosen to participate in assisting the signal transmissionfrom source to destination. In this way, cooperativediversity gain is achieved for physical-layer security enhancement.Case Study: Security Evaluation ofCooperative Relay TransmissionIn this section, we present a case study to show the physicallayersecurity improvement by exploiting cooperative relays,where only a single best relay is selected to assist the signaltransmission from source to destination. This differs fromexisting research efforts in [8], where multiple cooperativerelays participate in forwarding the source signal to the destination.For comparison purposes, we first consider conventionaldirect transmission as a benchmark scheme, where thesource node directly transmits its signal to the destinationwithout relay. Meanwhile, an eavesdropper is present andattempts to intercept the signal transmission from source todestination. As discussed in [3, 4], the secrecy capacity of conventionaldirect transmission is shown as the differencebetween the capacities of the main channel (from source todestination) and the wiretap channel (from source to eavesdropper),which is written as(1)where P is the transmit power at the source, N0 is the varianceof additive white Gaussian noise (AWGN), gs = P/N0 isregarded as the signal-to-noise ratio (SNR), and hsd and hserepresent fading coefficients of the channel from source todestination and from source to eavesdropper, respectively.Presently, there are three commonly used fading models (i.e.,Rayleigh, Rician, and Nakagami), and we consider the use oftheRayleigh fading model to characterize the main and wiretapchannels. Thus, |hsd|2 and |hse|2 are independent andexponentially distributed random variables with means ssd2and sse2 , respectively. Also, an ergodic secrecy capacity of thedirect transmission can be obtained by averaging the instantaneoussecrecy capacity Cs+ over the fading coefficients hsd andhse, where Cs+ = max (Cs,0). In addition, if the secrecy capacityCs falls below zero, the source transmission becomes insecure,and the eavesdropper will succeed in intercepting thesource signal. Thus, using Eq. 1 and denoting x = |hsd|2 and y= |hse|2, an intercept probability of the direct transmissioncan be given by(2)where the third equation arises from the fact that randomvariables |hsd|2 and |hse|2 are independent exponentially distributed,and ssd2 and sse2 are the expected values of |hsd|2and |hse|2, respectively. As can be observed from Eq. 2, theintercept probability of conventional direct transmission isindependent of the transmit power P, meaning that increasingthe transmit power cannot improve physical-layer security interms of intercept probability. This motivates us to explorethe use of cooperative relays to decrease the intercept probability.For notational convenience, let lme represent the ratioof average main channel gain ssd2 to an eavesdropper’s averagechannel gain sse2 , that is, lme = ssd2 /sse2 , which is referredto as the main-to-eavesdropper ratio (MER) throughout thisarticle. In the following, we present the cooperative relaytransmission scheme where multiple relays are used to assistthe signal transmission from source to destination. Here, theAF relaying protocol is considered, and only the best relaywill be selected to participate in forwarding the source signalto the destination. To be specific, the source node first broadcastsits signal to M relays. Then the best relay node is chosento forward a scaled version of its received signal to the destination[15]. Note that during the above mentioned cooperativerelay transmission process, the total amount of transmitpower at source and relay should be constrained to P to makea fair comparison with the conventional direct transmissionscheme. We here consider the equal power allocation; thus,the transmit power at the source and relay is given by P/2.Now, given M relays, it is crucial to determine which relayshould be selected as the best one to assist the source signaltransmission. Ideally, the best relay selection should aim tomaximize the secrecy capacity, which, however, requires theCSI of both the main and wiretap channels. Since the eavesdropperis passive, and the wiretap channel information is difficultto obtain in practice, we consider the main channelcapacity as the objective of best relay selection, which relieson knowledge of the main channel only. Accordingly, the bestrelay selection criterion with AF protocol is expressed as(3)∫∫( )= <<ó ó ó ó⎛⎝ ⎜⎞⎠ ⎟óó + ó<P Ch hx ydx dyPr ( 0)= Pr=1exp – –= ,ssd sex y sd se sd sesesd seintercept2 22 2 2 222 2= +⎛⎝ ⎜⎜⎞⎠ ⎟⎟+⎛⎝ ⎜⎜⎞⎠ ⎟⎟CP hNP hNs log 1 – log 1 ,sd se220220R=∈ +h hh hBest Relay argmax ,isi idsi id2 22 246 IEEE Network • January/February 2015Figure 4. A cooperative diversity system consisting of onesource, M relays, and one destination in the presence of aneavesdropper….RelaysEEavesdropperR1R2RMDDestinationSSourcewhere R denotes a set of M relays, and |hsi|2 and |hid|2 representfading coefficients of the channel from source to relayRi and that from relay Ri to destination, respectively. One cansee from Eq. 3 that the proposed best relay selection criteriononly requires the main channel information, |hsi|2 and |hid|2,with which the main channel capacity is maximized. Since themain and wiretap channels are independent of each other, thewiretap channel capacity will benefit nothing from the proposedbest relay selection. Similar to Eq. 1, the secrecy capacityof best relay selection can be obtained through subtractingthe main channel capacity from the corresponding wiretapchannel capacity. Also, the intercept probability of best relayselection is easily determined by computing the probabilitythat the secrecy capacity becomes less than zero.In Fig. 5, we provide the ergodic secrecy capacity comparisonbetween the conventional direct transmission and proposedbest relay selection schemes for different numbers ofrelays M with gs = 12 dB, ssd2 = 0.5, and ssr2 = srd2 = 2. It isshown in Fig. 5 that for the cases of M = 2, M = 4, and M =8, the ergodic secrecy capacity of the best relay selectionscheme is always higher than that of direct transmission,showing the wireless security benefits of using cooperativerelays. Also, as the number of relays M increases from M = 2to M = 8, the ergodic secrecy capacity of best relay selectionsignificantly increases. This means that increasing the numberof cooperative relays can improve the physical-layer securityof wireless transmission against eavesdropping attacks.Figure 6 shows the intercept probability vs. MER of theconventional direct transmission and proposed best relayselection schemes for different numbers of relays M with gs =12 dB, ssd2 = 0.5, and ssr2 = srd2 = 2. Note that the interceptprobability is obtained by calculating the rate of occurrence ofan intercept event when the capacity of the main channel fallsbelow that of the wiretap channel. Observe from Fig. 6 thatthe best relay selection scheme outperforms conventionaldirect transmission in terms of intercept probability. Moreover,as the number of cooperative relays M increases from M= 2 to M = 8, the intercept probability improvement of bestrelay selection over direct transmission becomes much moresignificant. It is also shown from Fig. 6 that the slope of theintercept probability curve of the best relay selection schemein high MER regions becomes steeper with an increasingnumber of relays. In other words, as the number of relaysincreases, the intercept probability of best relay selectiondecreases at a much higher speed with an increasing MER.This further confirms that the diversity gain is achieved by theproposed relay selection scheme for physical-layer securityimprovement.ConclusionThis article studies physical-layer security of wireless communicationsand presents several diversity techniques for improvingwireless security against eavesdroping attacks. We discussthe use of MIMO, multiuser diversity, and cooperative diversityfor the sake of increasing the secrecy capacity of wirelesstransmission. To illustrate the security benefits through diversity,we propose a case study of physical-layer security incooperative wireless networks with multiple relays, where thebest relay is selected to participate in forwarding the signaltransmission from source to destination. The secrecy capacityand intercept probability of the conventional direct transmissionand proposed best relay selection schemes are evaluatedin Rayleigh fading environments. It is shown that the bestrelay selection scheme outperforms direct transmission interms of both secrecy capacity and intercept probability.Moreover, as the number of cooperative relays increases, thesecurity improvement of the best relay selection scheme overdirect transmission becomes much more significant.Although extensive research efforts have been devoted towireless physical-layer security, many challenging but interestingissues remain open for future work. Specifically, most ofthe existing works in this subject are focused on enhancing thewireless secrecy capacity against the eavesdropping attackonly, but have neglected the joint consideration of differenttypes of wireless physical-layer attacks, including both eavesdroppingand denial of service (DoS) attacks. It is of greatimportance to explore new techniques of jointly defendingagainst multiple different wireless attacks. Furthermore, security,reliability, and throughput are the main driving factorsfor the research and development of next-generation wirelessnetworks, which are typically coupled and affect each other.For example, the security of the wireless physical layer may beimproved by generating artificial noise to confuse an eavesdroppingattack, which, however, comes at the expense ofdegrading wireless reliability and throughput performance,since artificial noise generation consumes some powerIEEE Network • January/February 2015 47Figure 5. Ergodic secrecy capacity vs. MER of the direct transmissionand best relay selection schemes with gs = 12 dB,ssd2 = 0.5, and ssr2 = srd2 = 2.Main-to-eavesdropper ratio (dB)-5 00.50Ergodic secrecy capacity (b/s/Hz)11.522.533.545 10 15 20Relay selection w/M = 8Relay selection w/M = 4Relay selection w/M = 2Direct transmissionFigure 6. Intercept probability vs. MER of the direct transmissionand best relay selection schemes with gs=12 dB, ssd2 =0.5, and ssr2 = srd2 = 2.Main-to-eavesdropper ratio (dB)-5 010-310-4Intercept probability10-210-11005 10 15Direct transmissionRelay selection w/M = 2Relay selection w/M = 4Relay selection w/M = 8resources, and less transmit power becomes available for thedesired information transmission. Thus, it is of interest toinvestigate the joint optimization of security, reliability, andthroughput for the wireless physical layer, which is a challengingissue to be solved in the future.AcknowledgmentThis work was supported by the “1000 Young Talents Program”of China, the National Natural Science Foundation ofChina (Grant No. 61302104), and the Scientific ResearchFoundation of Nanjing University of Posts and Telecommunications(Grant No. NY213014).

Identity-Based Encryption with Outsourced Revocation in Cloud Computing

05/08/201902/07/2019 by admin

Identity-Based Encryption (IBE) which simplifies the public key and certificate management at Public Key Infrastructure (PKI) is an important alternative to public key encryption. However, one of the main efficiency drawbacks of IBE is the overhead computation at Private Key Generator (PKG) during user revocation. Efficient revocation has been well studied in traditional PKI setting, but the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. In this paper, aiming at tackling the critical issue of identity revocation, we introduce outsourcing computation into IBE for the first time and propose a revocable IBE scheme in the server-aided setting.

Our scheme offloads most of the key generation related operations during key-issuing and key-update processes to a Key Update Cloud Service Provider, leaving only a constant number of simple operations for PKG and users to perform locally. This goal is achieved by utilizing a novel collusion-resistant technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound the identity component and the time component. Furthermore, we propose another construction which is provable secure under the recently formulized Refereed Delegation of Computation model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

INTRODUCTION:

Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys. Therefore, sender using IBE does not need to look up public key and certificate, but directly encrypts message with receiver’s identity.

Accordingly, receiver obtaining the private key associated with the corresponding identity from Private Key Generator (PKG) is able to decrypt such ciphertext. Though IBE allows an arbitrary string as the public key which is considered as appealing advantages over PKI, it demands an efficient revocation mechanism. Specifically, if the private keys of some users get compromised, we must provide a mean to revoke such users from system. In PKI setting, revocation mechanism is realized by appending validity periods to certificates or using involved combinations of techniques.

Nevertheless, the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. As far as we know, though revocation has been thoroughly studied in PKI, few revocation mechanisms are known in IBE setting. In Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period. But this mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

In presented a revocable IBE scheme. Their scheme is built on the idea of fuzzy IBE primitive but utilizing a binary tree data structure to record users’ identities at leaf nodes. Therefore, key-update efficiency at PKG is able to be significantly reduced from linear to the height of such binary tree (i.e. logarithmic in the number ofusers). Nevertheless, we point out that though the binary tree introduction is able to achieve a relative high performance, it will result in other problems:

1) PKG has to generate a key pair for all the nodes on the path from the identity leaf node to the root node, which results in complexity logarithmic in the number of users in system for issuing a single private key.

2) The size of private key grows in logarithmic in the number of users in system, which makes it difficult in private key storage for users.

3) As the number of users in system grows, PKG has to maintain a binary tree with a large amount of nodes, which introduces another bottleneck for the global system. In tandem with the development of cloud computing, there has emerged the ability for users to buy on-demand computing from cloud-based services such as Amazon’s EC2 and Microsoft’s Windows Azure. Thus it desires a new working paradigm for introducing such cloud services into IBE revocation to fix the issue of efficiency and storage overhead described above. A naive approach would be to simply hand over the PKG’s master key to the Cloud Service Providers (CSPs).

The CSPs could then simply update all the private keys by using the traditional key update technique [4] and transmit the private keys back to unrevoked users. However, the naive approach is based on an unrealistic assumption that the CSPs are fully trusted and is allowed to access the master key for IBE system. On the contrary, in practice the public clouds are likely outside of the same trusted domain of users and are curious for users’ individual privacy. For this reason, a challenge on how to design a secure revocable IBE scheme to reduce the overhead computation at PKG with an untrusted CSP is raised.

In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally. In our scheme, as with the suggestion in realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users.

We propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component. At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).

Our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP. We also specify that 1) with the aid of KU-CSP, user needs not to contact with PKG in key-update, and in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP. 2) No secure channel or user authentication is required during key-update between user and KU-CSP. Furthermore, we consider realizing revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction

EXISTING SYSTEM:

Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys.

Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period.

Hanaoka et al. proposed a way for users to periodically renew their private keys without interacting with PKG.

Lin et al. proposed a space efficient revocable IBE mechanism from non-monotonic Attribute-Based Encryption (ABE), but their construction requires times bilinear pairing operations for a single decryption where the number of revoked users is.

DISADVANTAGES:

Boneh and Franklin mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

Boneh and Franklin’s suggestion is more a viable solution but impractical.
In Hanaoka et al system, however, the assumption required in their work is that each user needs to possess a tamper-resistant hardware device.
If an identity is revoked then the mediator is instructed to stop helping the user. Obviously, it is impractical since all users are unable to decrypt on their own and they need to communicate with mediator for each decryption.

PROPOSED SYSTEM:

In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally.

In our scheme, as with the suggestion, we realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users, we propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component.

At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).

ADVANTAGES:

Compared with the previous work, our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP.
We also specify in the aid of KU-CSP, user needs not to contact with PKG in key-update, in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP.
No secure channel or user authentication is required during key-update between user and KU-CSP.
Furthermore, we consider to realize revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model.
Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

ARCHITECTURE DIAGRAM:

IMPLEMENTATION:

IBE SCHEME (IDENTITY-BASED ENCRYPTION)

ALGORITHM USED:

KEYCOMBINE ALGORITHM:

MODULES:

USER MODULES:

ADMIN:
OWNER:
USERS:

PKG (PRIVATE KEY GENERATOR):

KU-CSPS MODELS:

USERS REVOCATION:

PERFORMANCE EVALUATION:

CONCLUSION:

In this paper, focusing on the critical issue of identity revocation, we introduce outsourcing computation into IBE and propose a revocable scheme in which the revocation operations are delegated to CSP. With the aid of KU-CSP, the proposed scheme is full-featured: 1) It achieves constant efficiency for both computation at PKG and private key size at user; 2) User needs not to contact with PKG during keyupdate, in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP; 3) No secure channel or user authentication is required during key-update between user and KU-CSP. Furthermore, we consider realizing revocable IBE under a stronger adversary model. We present an advanced construction and show it is secure under RDoC model, in which at least one of the KU-CSPs is assumed to be honest. Therefore, even if a revoked user and either of the KU-CSPs collude, it is unable to help such user re-obtain his/her decryptability. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

Identity-Based Distributed Provable Data Possession in Multicloud Storage

05/08/201902/07/2019 by admin

3.2 DATAFLOW DIAGRAM

PUBLISHER:

SUBSCRIBER:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

PUBLISHER:

SUBSCRIBER:

3.3 CLASS DIAGRAM:

PUBLISHER:

SUBSCRIBER:

3.4 SEQUENCE DIAGRAM:

PUBLISHER:

SUBSCRIBER:

3.5 ACTIVITY DIAGRAM:

PUBLISHER:

SUBSCRIBER:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

4.3 MODULE DESCRIPTION:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION

Generating Searchable Public-Key Ciphertexts with Hidden Structures for Fast Keyword Search

05/08/201902/07/2019 by admin

In this paper proposes Searchable Public-Key Ciphertexts with Hidden Structures (SPCHS) for keyword search as fast as possible without sacrificing semantic security of the encrypted keywords. In SPCHS, all keyword-searchable ciphertexts are structured by hidden relations, and with the search trapdoor corresponding to a keyword, the minimum information of the relations is disclosed to a search algorithm as the guidance to find all matching ciphertexts efficiently.

We construct a SPCHS scheme from scratch in which the ciphertexts have a hidden star-like structure. We prove our scheme to be semantically secure in the Random Oracle (RO) model. The search complexity of our scheme is dependent on the actual number of the ciphertexts containing the queried keyword, rather than the number of all ciphertexts.

Finally, we present a generic SPCHS construction from anonymous identity-based encryption and collision-free full-identity malleable Identity-Based Key Encapsulation Mechanism (IBKEM) with anonymity. We illustrate two collision-free full-identity malleable IBKEM instances, which are semantically secure and anonymous, respectively, in the RO and standard models. The latter instance enables us to construct an SPCHS scheme with semantic security in the standard model.

1.2 INTRODUCTION:

We start by formally defining the concept of Searchable Public-key Ciphertexts with Hidden Structures (SPCHS) and its semantic security. In this new concept, keywordsearchable ciphertexts with their hidden structures can be generated in the public key setting; with a keyword search trapdoor, partial relations can be disclosed to guide the discovery of all matching ciphertexts. Semantic security is defined for both the keywords and the hidden structures. It is worth noting that this new concept and its semantic security are suitable for keyword-searchable ciphertexts with any kind of hidden structures. In contrast, the concept of traditional PEKS does not contain any hidden structure among the PEKS ciphertexts; correspondingly, its semantic security is only defined for the keywords. Following the SPCHS definition, we construct a simple SPCHS from scratch in the random oracle (RO) model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. The search performance mainly depends on the actual number of the ciphertexts containing the queried keyword. For security, the scheme is proven semantically secure based on the Decisional Bilinear DiffieHellman (DBDH) assumption in the RO model.

We build a generic SPCHS construction with IdentityBased Encryption (IBE) and collision-free full-identity malleable IBKEM. The resulting SPCHS can generate keyword-searchable ciphertexts with a hidden star-like structure. Moreover, if both the underlying IBKEM and IBE have semantic security and anonymity (i.e. the privacy of receivers’ identities), the resulting SPCHS is semantically secure. As there are known IBE schemes [4], [5], [6], [7] in both the RO model and the standard model, an SPCHS construction is reduced to collision-free full-identity malleable IBKEM with anonymity. We proposed several IBKEM schemes to construct Verifiable Random Functions2 (VRF). We show that one of these IBKEM schemes is anonymous and collision-free fullidentity malleable in the RO model. We transform this IBE scheme into a collision-free full-identity malleable IBKEM scheme with semantic security and anonymity in the standard model. Hence, this new IBKEM scheme allows us to build SPCHS schemes secure in the standard model with the same search performance as the previous SPCHS construction from scratch in the RO model.

1.3 LITRATURE SURVEY

TITLE: FUZZY KEYWORD SEARCH OVER ENCRYPTED DATA IN CLOUD COMPUTING

AUTOHR: Li J., Wang Q., Wang C., Cao N., Ren K., Lou W

PUBLISH: IEEE INFOCOM 2010, pp. 1-5. (2010)

EXPLANATION:

As Cloud Computing becomes prevalent, more and more sensitive information are being centralized into the cloud. For the protection of data privacy, sensitive data usually have to be encrypted before outsourcing, which makes effective data utilization a very challenging task. Although traditional searchable encryption schemes allow a user to securely search over encrypted data through keywords and selectively retrieve files of interest, these techniques support only exact keyword search. That is, there is no tolerance of minor typos and format inconsistencies which, on the other hand, are typical user searching behavior and happen very frequently. This significant drawback makes existing techniques unsuitable in Cloud Computing as it greatly affects system usability, rendering user searching experiences very frustrating and system efficacy very low. In this paper, for the first time we formalize and solve the problem of effective fuzzy keyword search over encrypted cloud data while maintaining keyword privacy. Fuzzy keyword search greatly enhances system usability by returning the matching files when users’ searching inputs exactly match the predefined keywords or the closest possible matching files based on keyword similarity semantics, when exact match fails. In our solution, we exploit edit distance to quantify keywords similarity and develop an advanced technique on constructing fuzzy keyword sets, which greatly reduces the storage and representation overheads. Through rigorous security analysis, we show that our proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search.

TITLE: ANONYMOUS FUZZY IDENTITY-BASED ENCRYPTION FOR SIMILARITY SEARCH

AUTOHR: Cheung D. W., Mamoulis N., Wong W. K., Yiu S. M., Zhang

PUBLISH: ISAAC 2010. LNCS, vol. 6505, pp. 61-72. Springer, Heidelberg (2010)

EXPLANATION:

The predicate that was studied in the very beginning is “exact keyword matching”. That is, whether the value hidden by the token is equal to the attribute value hidden in the ciphertext. Schemes that only provide data item security are basically “Identity-Based Encryption”. Schemes protecting both the data item and the attributes were initiated in the private-key setting public-key setting. Relationship between and “Anonymous Identity-Based Encryption” was revisited in range query as the predicate was also considered. Boneh et al. devised an Augmented Broadcast Encryption which allows checking if the attribute value falls within a range on encrypted data. Their scheme also provides attribute protection. Then, Boneh and Waters extended it to multi-dimensional range query.

However, there is no practical scheme supporting this predicate with attribute protection in public-key settings investigated this problem in the private-key setting and is IND2-CKA secure. His scheme is in a public-key setting. However, the scheme requires the threshold value t to be fixed in the setup time. Our work is using as a framework provided schemes for handling predicates represented as inner products. Their formulation of using inner products with bounded disjunction is powerful. We show how to reduce inner products to hamming distance similarity comparison predicate, and then derive a slightly different encryption scheme for better performance when considering the inequality case. In our work, we consider the problem of attribute protection in public-key setting. In some applications, people may also want to provide protection to predicate (“the token”), which is inherently unachievable in public-key setting. Note that a predicate encryption supporting inner product in private-key setting has been devised in which can provide predicate privacy

TITLE: TRAPDOOR PRIVACY IN ASYMMETRIC SEARCHABLE ENCRYPTION SCHEMES

AUTOHR: Arriaga A., Tang Q., Ryan P

PUBLISH: AFRICACRYPT 2014. LNCS, vol. 8469, pp. 31-50. Springer, Heidelberg (2014)

EXPLANATION:

Asymmetric searchable encryption allows searches to be carried over ciphertexts, through delegation, and by means of trapdoors issued by the owner of the data. Public Key Encryption with Keyword Search (PEKS) is a primitive with such functionality that provides delegation of exact-match searches. As it is important that ciphertexts preserve data privacy, it is also important that trapdoors do not expose the user’s search criteria. The difficulty of formalizing a security model for trapdoor privacy lies in the verification functionality, which gives the adversary the power of verifying if a trapdoor encodes a particular keyword. In this paper, we provide a broader view on what can be achieved regarding trapdoor privacy in asymmetric searchable encryption schemes, and bridge the gap between previous definitions, which give limited privacy guarantees in practice against search patterns. We propose the notion of Strong Search Pattern Privacy for PEKS and construct a scheme that achieves this security notion.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing semantically secure PEKS schemes take search time linear with the total number of all ciphertexts. This makes retrieval from large-scale databases prohibitive. Therefore, more efficient search performance is crucial for practically deploying PEKS schemes. One of the prominent works to accelerate the search over encrypted keywords in the public-key setting enabling search over encrypted keywords to be as effi- cient as the search for unencrypted keywords, such that a ciphertext containing a given keyword can be retrieved in time complexity logarithmic in the total number of all ciphertexts.

This is reasonable because the encrypted keywords can form a tree-like structure when stored according to their binary values. However, deterministic encryption has two inherent limitations. First, keyword privacy can be guaranteed only for keywords that are a priori hardto-guess by the adversary (i.e., keywords with high minentropy to the adversary); second, certain information of a message leaks inevitably via the ciphertext of the keywords since the encryption is deterministic. Hence, deterministic encryption is only applicable in special scenarios.

Observe that a keyword space is usually of no high minentropy in many scenarios. Semantic security is crucial to guarantee keyword privacy in such applications. Thus the linear search complexity of existing schemes is the major obstacle to their adoption. Unfortunately, the linear complexity seems to be inevitable because the server has to scan and test each ciphertext, due to the fact that these ciphertexts (corresponding to the same keyword or not) are indistinguishable to the server.

2.1.1 DISADVANTAGES:

Each sender should be able to generate the keyword-searchable ciphertexts with the hidden star-like structure by the receiver’s public-key; the server having a keyword search trapdoor should be able to disclose partial relations, which is related to all matching ciphertexts. Semantic security is preserved 1) if no keyword search trapdoor is known, all ciphertexts are indistinguishable, and no information is leaked about the structure, and 2) given a keyword search trapdoor, only the corresponding relations can be disclosed, and the matching ciphertexts leak no information about the rest of ciphertexts, except the fact that the rest do not contain the queried keyword.

The integrity of data is not possible in existing system
An existing system public verifier does not check the data in multi cloud

2.2 PROPOSED SYSTEM:

We propose methods of searchable Public-key Ciphertexts with Hidden Structures (SPCHS) and its semantic security. In this new concept, keywordsearchable ciphertexts with their hidden structures can be generated in the public key setting; with a keyword search trapdoor, partial relations can be disclosed to guide the discovery of all matching ciphertexts. Semantic security is defined for both the keywords and the hidden structures. Following the SPCHS definition, we construct a simple SPCHS from scratch in the random oracle (RO) model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. The search performance mainly depends on the actual number of the ciphertexts containing the queried keyword.

We are also interested in providing a generic SPCHS construction to generate keyword-searchable ciphertexts with a hidden star-like structure. Our generic SPCHS is inspired by several interesting observations on Identity-Based Key Encapsulation Mechanism (IBKEM). We build a generic SPCHS construction with IdentityBased Encryption (IBE) and collision-free full-identity malleable IBKEM. The resulting SPCHS can generate keyword-searchable ciphertexts with a hidden star-like structure. Moreover, if both the underlying IBKEM and IBE have semantic security and anonymity (i.e. the privacy of receivers’ identities), the resulting SPCHS is semantically secure. As there are known IBE schemes in both the RO model and the standard model, an SPCHS construction is reduced to collision-free full-identity malleable IBKEM.

2.2.1 ADVANTAGES:

IBKEM schemes to construct Verifiable Random Functions2 (VRF) [8]. We show that one of these IBKEM schemes is anonymous and collision-free fullidentity malleable in the RO model utilized the “approximation” of multilinear maps to construct a standard-model version of Boneh-and-Franklin (BF) IBE scheme.

We transform this IBE scheme into a collision-free full-identity malleable IBKEM scheme with semantic security and anonymity in the standard model. Hence, this new IBKEM scheme allows us to build SPCHS schemes secure in the standard model with the same search performance as the previous SPCHS construction from scratch in the RO model.

In our proposed system each client has a private correspond to his identity (i.e.) name, id or any…
The public verifier allow the user to correspond to his identity (i.e.) private Key

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

LEVEL I:

LEVEL II:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

SPCHS SCHEME:

We first explain intuitions behind SPCHS. We describe a hidden structure formed by ciphertexts as (C, Pri, Pub), where C denotes the set of all ciphertexts, Pri denotes the hidden relations among C, and Pub denotes the public parts. In case there is more than one hidden structure formed by ciphertexts, the description of multiple hidden structures formed by ciphertexts can be

In SPCHS, the encryption algorithm has two functionalities. One is to encrypt a keyword, and the other is to generate a hidden relation, which can associate the generated ciphertext to the hidden structure. Let (Pri, Pub) be the hidden structure. The encryption algorithm must take Pri as input, otherwise the hidden relation cannot be generated since Pub does not contain anything about the hidden relations. At the end of the encryption procedure, the Pri should be updated since a hidden relation is newly generated (but the specific method to update Pri relies on the specific instance of SPCHS). In addition, SPCHS needs an algorithm to initialize (Pri, Pub) by taking the master public key as input, and this algorithm will be run before the first time to generate a ciphertext. With a keyword search trapdoor, the search algorithm of SPCHS can disclose partial relations to guide the discovery of the ciphertexts containing the queried keyword with the hidden structure.

4.1 ALGORITHM

IBKEM ALGORITHM:

In this section, we formalize collision-free full-identity malleable IBKEM and a generic SPCHS construction from IBKEM. Our generic construction also relies on a notion of collision-free full-identity malleable IBKEM. The following IBKEM definition is derived from [47]. A difference only appears in algorithm EncapsIBKEM. In order to highlight that the generator of an IBKEM encapsulation knows the chosen random value used in algorithm EncapsIBKEM, we take the random value as an input of the algorithm.

The collision-free full-identity malleable IBKEM implies the following characteristics: all identities’ decryption keys can decapsulate the same encapsulation; all decapsulated keys are collision-free; the generator of the encapsulation can also compute these decapsulated keys; the decapsulated keys of different encapsulations are also collision-free.

A collision-free full-identity malleable IBKEM scheme may preserve semantic security and anonymity. We incorporate the semantic security and anonymity into AnonSS-ID-CPA secure IBKEM. But this security is different from the traditional version [47] of the Anon-SS-ID-CPA security due to the full-identity malleability of IBKEM.

4.2 MODULES:

USER MODULES:

IDENTITY BASED ENCRYPTION:

FAST SEARCHABLE ENCRYPTION:

SEMANTIC DATA SECURITY:

4.3 MODULE DESCRIPTION:

USER MODULE:

ADMIN:

In this module is used to help the server to view details and upload files with the security. Admin upload the data’s to database. Also view the subscriber details and user details. Admin find the redistribute details. Also who send the data and receive the data’s. Data owner store large amount of data to clouds and access data using secure key provided admin after encrypting data’s. Encrypt the data using SECY. User store data after auditor, view and verifying data and also changed data. User again views data at that time admin provided the message to user only changes data.

PROVIDER:

In this module subscriber choose document and download the data’s from service providers. Subscribers pay the amount to service provider. Service provider provides that data key to subscriber. So subscribers download the data using data key. A cloud computing service provider serves users’ service requests by using a server system, which is constructed and maintained by an infrastructure vendor and rented by the service provider.

USER:

In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first user can register their details like user name, password, email, mobile no, and then. We develop this module, where the cloud storage can be made secure.

IDENTITY BASED ENCRYPTION:

Batch identity-based key distribution: A direct application of collision-free full-identity malleable IBKEM is to achieve batch identity-based key distribution. In such an application, a sender would like to distribute different secret session keys to multiple receivers so that each receiver can only know the session key to himself/herself. With collision-free full-identity malleable IBKEM, a sender just needs to broadcast an IBKEM encapsulation in the identitybased cryptography setting, e.g., encapsulating a session key K to a single user ID. According to the collisionfreeness of IBKEM, each receiver ID0 can decapsulate and obtain a different key K0 with his/her secret key in the identity based crypto-system. Due to the full-identity malleability, the sender knows the decapsulated keys of all the receivers.

Anonymous identity-based broadcast encryption: A slightly more complicated application is anonymous identity-based broadcast encryption with efficient decryption. An analogous application was proposed respectively application will work if the IBKEM is collision-free full-identity malleable. It preserves the anonymity of receivers if the IBKEM is anonymous. Note that trivial anonymous broadcast encryption suffers decryption cost linear with the number of the receivers. In contrast, our anonymous identity-based broadcast encryption enjoys constant decryption cost, plus logarithmic complexity to search the matching index in a set (K1 1 , …, KN 1 ) organized by a certain partial order, e.g., a dictionary order according to their binary representations.

FAST SEARCHABLE ENCRYPTION:

As-fast-as-possible search in PEKS with semantic security. We proposed the concept of SPCHS as a variant of PEKS. The new concept allows keyword-searchable ciphertexts to be generated with a hidden structure. Given a keyword search trapdoor, the search algorithm of SPCHS can disclose part of this hidden structure for guidance on finding out the ciphertexts of the queried keyword. Semantic security of SPCHS captures the privacy of the keywords and the invisibility of the hidden structures. We proposed an SPCHS scheme from scratch with semantic security in the RO model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. It has search complexity mainly linear with the exact number of the ciphertexts containing the queried keyword. It outperforms existing PEKS schemes with semantic security, whose search complexity is linear with the number of all ciphertexts.

We identified several interesting properties, i.e., collision-freeness and full-identity malleability in some IBKEM instances, and formalized these properties to build a generic SPCHS construction. We illustrated two collision-free full-identity malleable IBKEM instances, which are respectively secure in the RO and standard models. SPCHS seems a promising tool to solve some challenging problems in public-key searchable encryption. One application may be to achieve retrieval completeness verification which, to the best of our knowledge, has not been achieved in existing PEKS schemes. Specifically, by forming a hidden ring-like structure, i.e., letting the last hidden pointer always point to the head, one can obtain PEKS allowing to check the completeness of the retrieved ciphertexts by checking whether the pointers of the returned ciphertexts form a ring.

SEMANTIC DATA SECURITY:

The SS-CKSA security of the above SPCHS scheme relies on the DBDH assumption in Even in the case that a sender gets his local privacy Pri compromised, SPCHS still offers forward security. This means that the existing hidden structure of ciphertexts stays confidential, since the local privacy only contains the relationship of the new generated ciphertexts. To offer backward security with SPCHS, the sender can initialize a new structure by algorithm Structure Initialization for the new generated ciphertexts. A collision-free full-identity malleable IBKEM scheme may preserve semantic security and anonymity.

We incorporate the semantic security and anonymity into AnonSS-ID-CPA secure IBKEM. But this security is different from the traditional version of the Anon-SS-ID-CPA security due to the full-identity malleability of IBKEM. The difference will be introduced after defining that security. In that security, a PPT adversary is allowed to query the decryption keys for adaptively chosen identities, and adaptively choose two challenge identities. The Anon-SSID-CPA security of IBKEM means that for a challenge key-and-encapsulation pair, the adversary cannot determine the correctness of this pair and the challenge identity of this pair, given that the adversary does not know the two challenging identities’ decryption keys in the Anon-SSID-CPA security of a collision-free full-identity malleable IBKEM scheme.

The SS-sK-CKSA security of the above generic SPCHS construction relies on the AnonSS-sID-CPA security of the underlying IBKEM and the Anon-SS-ID-CPA security of the underlying IBE. In the security proof, we prove that if there is an adversary who can break the SS-sK-CKSA security of the above generic SPCHS construction, then there is another adversary who can break the Anon-SS-sID-CPA security of the underlying IBKEM or the Anon-SS-ID-CPA security of the underlying IBE.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

This paper investigated as-fast-as-possible search in PEKS with semantic security. We proposed the concept of SPCHS as a variant of PEKS. The new concept allows keyword-searchable ciphertexts to be generated with a hidden structure. Given a keyword search trapdoor, the search algorithm of SPCHS can disclose part of this hidden structure for guidance on finding out the ciphertexts of the queried keyword. Semantic security of SPCHS captures the privacy of the keywords and the invisibility of the hidden structures. We proposed an SPCHS scheme from scratch with semantic security in the RO model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. It has search complexity mainly linear with the exact number of the ciphertexts containing the queried keyword. It outperforms existing PEKS schemes with semantic security, whose search complexity is linear with the number of all ciphertexts.

Friendbook A Semantic-Based Friend Recommendation System for Social Networks

05/08/201902/07/2019 by admin

Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs. By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents, from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm.

We further propose a similarity metric to measure the similarity of life styles between users, and calculate users’ impact in terms of life styles with a friend-matching graph. Upon receiving a request, Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a feedback mechanism to further improve the recommendation accuracy. We have implemented Friendbook on the Android-based smartphones, and evaluated its performance on both small-scale experiments and large-scale simulations. The results show that the recommendations accurately reflect the preferences of users in choosing friends.

1.2 INTRODUCTION:

What Is A Social Network?

Wikipedia defines a social network service as a service which “focuses on the building and verifying of online social networks for communities of people who share interests and activities, or who are interested in exploring the interests and activities of others, and which necessitates the use of software.”

A report published by OCLC provides the following definition of social networking sites: “Web sites primarily designed to facilitate interaction between users who share interests, attitudes and activities, such as Facebook, Mixi and MySpace.”

What Can Social Networks Be Used For?

Social networks can provide a range of benefits to members of an organization:

Support for learning: Social networks can enhance informal learning and support social connections within groups of learners and with those involved in the support of learning.

Support for members of an organisation: Social networks can potentially be used my all members of an organisation, and not just those involved in working with students. Social networks can help the development of communities of practice.

Engaging with others: Passive use of social networks can provide valuable business intelligence and feedback on institutional services (although this may give rise to ethical concerns).

Ease of access to information and applications: The ease of use of many social networking services can provide benefits to users by simplifying access to other tools and applications. The Facebook Platform provides an example of how a social networking service can be used as an environment for other tools.

Common interface: A possible benefit of social networks may be the common interface which spans work / social boundaries. Since such services are often used in a personal capacity the interface and the way the service works may be familiar, thus minimising training and support needed to exploit the services in a professional context. This can, however, also be a barrier to those who wish to have strict boundaries between work and social activities.

Examples of popular social networking services include:

Facebook: Facebook is a social networking Web site that allows people to communicate with their friends and exchange information. In May 2007 Facebook launched the Facebook Platform which provides a framework for developers to create applications that interact with core Facebook features

MySpace: MySpace is a social networking Web site offering an interactive, user-submitted network of friends, personal profiles, blogs and groups, commonly used for sharing photos, music and videos.

Ning: An online platform for creating social Web sites and social networks aimed at users who want to create networks around specific interests or have limited technical skills.

Twitter: Twitter is an example of a micro-blogging service. Twitter can be used in a variety of ways including sharing brief information with users and providing support for one’s peers.

Note that this brief list of popular social networking services omits popular social sharing services such as Flickr and YouTube.

Opportunities and Challenges

The popularity and ease of use of social networking services have excited institutions with their potential in a variety of areas. However effective use of social networking services poses a number of challenges for institutions including long-term sustainability of the services; user concerns over use of social tools in a work or study context; a variety of technical issues and legal issues such as copyright, privacy, accessibility; etc.

Institutions would be advised to consider carefully the implications before promoting significant use of such services.

Twenty years ago, people typically made friends with others who live or work close to themselves, such as neighbors or colleagues. We call friends made through this traditional fashion as G-friends, which stands for geographical location-based friends because they are influenced by the geographical distances between each other. With the rapid advances in social networks, services such as Facebook, Twitter and Google+ have provided us revolutionary ways of making friends. According to Facebook statistics, a user has an average of 130 friends, perhaps larger than any other time in history. One challenge with existing social networking services is how to recommend a good friend to a user. Most of them rely on pre-existing user relationships to pick friend candidates.

For example, Facebook relies on a social link analysis among those who already share common friends and recommends symmetrical users as potential friends. Unfortunately, this approach may not be the most appropriate based on recent sociology findings. According to these studies, the rules to group people together include: 1) habits or life style; 2) attitudes; 3) tastes; 4) moral standards; 5) economic level; and 6) people they already know. Rather, life styles are usually closely correlated with daily routines and activities. Therefore, if we could gather information on users’ daily routines and activities, we can exploit rule #1 and recommend friends to people based on their similar life styles. This recommendation mechanism can be deployed as a standalone app on smartphones or as an add-on to existing social network frameworks. In both cases, Friendbook can help mobile phone users find friends either among strangers or within a certain group as long as they share similar life styles.

1.3 LITRATURE SURVEY:

1) “Probabilistic mining of socio geographic routines from mobile phone data”

AUTHORS: K. Farrahi and D. Gatica-Perez

There is relatively little work on the investigation of large-scale human data in terms of multimodality for human activity discovery. In this paper, we suggest that human interaction data, or human proximity, obtained by mobile phone Bluetooth sensor data, can be integrated with human location data, obtained by mobile cell tower connections, to mine meaningful details about human activities from large and noisy datasets. We propose a model, called bag of multimodal behavior that integrates the modeling of variations of location over multiple time-scales, and the modeling of interaction types from proximity. Our representation is simple yet robust to characterize real-life human behavior sensed from mobile phones, which are devices capable of capturing large-scale data known to be noisy and incomplete. We use an unsupervised approach, based on probabilistic topic models, to discover latent human activities in terms of the joint interaction and location behaviors of 97 individuals over the course of approximately a 10-month period using data from MIT’s Reality Mining project. Some of the human activities discovered with our multimodal data representation include “going out from 7 pm-midnight alone” and “working from 11 am-5 pm with 3-5 other people,” further finding that this activity dominantly occurs on specific days of the week. Our methodology also finds dominant work patterns occurring on other days of the week. We further demonstrate the feasibility of the topic modeling framework for human routine discovery by predicting missing multimodal phone data at specific times of the day.

2. Collaborative and structural recommendation of friends using weblog-based social network analysis

AUTHORS: W. H. Hsu, A. King, M. Paradesi, T. Pydimarri, and T. Weninger

In this paper, we address the problem of link recommendation in weblogs and similar social networks. First, we present an approach based on collaborative recommendation using the link structure of a social network and content-based recommendation using mutual declared interests. Next, we describe the application of this approach to a small representative subset of a large real-world social network: the user/community network of the blog service Live Journal. We then discuss the ground features available in Live Journal’s public user information pages and describe some graph algorithms for analysis of the social network. These are used to identify candidates, provide ground truth for recommendations, and construct features for learning the concept of a recommended link. Finally, we compare the performance of this machine learning approach to that of the rudimentary recommender system provided by Live Journal.

3. Understanding Transportation Modes Based on GPS Data for Web Applications.

AUTHORS: Y. Zheng, Y. Chen, Q. Li, X. Xie, and W.-Y. Ma.

User mobility has given rise to a variety of Web applications, in which the global positioning system (GPS) plays many important roles in bridging between these applications and end users. As a kind of human behavior, people’s transportation modes, such as walking and driving, can provide pervasive computing systems with more contextual information and enrich a user’s mobility with informative knowledge. In this article, we report on an approach based on supervised learning to automatically infer users’ transportation modes, including driving, walking, taking a bus and riding a bike, from raw GPS logs. Our approach consists of three parts: a change point-based segmentation method, an inference model and a graph-based post-processing algorithm. First, we propose a change point-based segmentation method to partition each GPS trajectory into separate segments of different transportation modes. Second, from each segment, we identify a set of sophisticated features, which are not affected by differing traffic conditions (e.g., a person’s direction when in a car is constrained more by the road than any change in traffic conditions). Later, these features are fed to a generative inference model to classify the segments of different modes. Third, we conduct graph-based post-processing to further improve the inference performance. This post-processing algorithm considers both the commonsense constraints of the real world and typical user behaviors based on locations in a probabilistic manner. The advantages of our method over the related works include three aspects. 1) Our approach can effectively segment trajectories containing multiple transportation modes. 2) Our work mined the location constraints from user-generated GPS logs, while being independent of additional sensor data and map information like road networks and bus stops. 3) The model learned from the dataset of some users can be applied to infer GPS data from others. Using the GPS logs collected by 65 people over a period of 10 months, we evaluated our approach via a set of experiments. As a result, based on the change-point-based segmentation method and Decision Tree-based inference model, we achieved prediction accuracy greater than 71 percent. Further, using the graph-based post-processing algorithm, the performance attained a 4-percent enhancement.

4. Online friend recommendation through personality matching and collaborative filtering

AUTHORS: L. Bian and H. Holtzman

Most social network websites rely on people’s proximity on the social graph for friend recommendation. In this paper, we present Matchmaker, a collaborative filtering friend recommendation system based on personality matching. The goal of Matchmaker is to leverage the social information and mutual understanding among people in existing social network connections, and produce friend recommendations based on rich contextual data from people’s physical world interactions. Matchmaker allows users’ network to match them with similar TV characters, and uses relationships in the TV programs as parallel comparison matrix to suggest to the users friends that have been voted to suit their personality the best. The system’s ranking schema allows progressive improvement on the personality matching consensus and more diverse branching of users’ social network connections. Lastly, our user study shows that the application can also induce more TV content consumption by driving users’ curiosity in the ranking process.

CHAPTER 2

2.0 SYSTEM ANALYSIS:

2.1 EXISTING SYSTEM:

Most of the friend suggestions mechanism relies on pre-existing user relationships to pick friend candidates. For example, Facebook relies on a social link analysis among those who already share common friends and recommends symmetrical users as potential friends. The rules to group people together include:

Habits or life style
Attitudes
Tastes
Moral standards
Economic level; and
People they already know.

Apparently, rule #3 and rule #6 are the mainstream factors considered by existing recommendation systems.

2.1.1 DISADVANTAGES:

Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life

2.2 PROPOSED SYSTEM:

A novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs.
By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity.
We model a user’s daily life as life documents, from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm.
Similarity metric to measure the similarity of life styles between users, and calculate users’
Impact in terms of life styles with a friend-matching graph.
We integrate a linear feedback mechanism that exploits the user’s feedback to improve recommendation accuracy.

2.2.1 ADVANTAGES:

Recommend potential friends to users if they share similar life styles.
The feedback mechanism allows us to measure the satisfaction of users, by providing a user interface that allows the user to rate the friend list

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

LIFE STYLE MODELING
ACTIVITY RECOGNITION
FRIEND-MATCHING GRAPH CONSTRUCTION
USER IMPACT RANKING

4.3 MODULE DESCRIPTION:

LIFE STYLE MODELING:

Life styles and activities are reflections of daily lives at two different levels where daily lives can be treated as a mixture of life styles and life styles as a mixture of activities. This is analogous to the treatment of documents as ensemble of topics and topics as ensemble of words. By taking advantage of recent developments in the field of text mining, we model the daily lives of users as life documents, the life styles as topics, and the activities as words. Given “documents”, the probabilistic topic model could discover the probabilities of underlying “topics”. Therefore, we adopt the probabilistic topic model to discover the probabilities of hidden “life styles” from the “life documents”. Our objective is to discover the life style vector for each user given the life documents of all users.

ACTIVITY RECOGNITION:

We need to first classify or recognize the activities of users. Life styles are usually reflected as a mixture of motion activities with different occurrence probability. Generally speaking, there are two mainstream approaches: supervised learning and unsupervised learning. For both approaches, mature techniques have been developed and tested. In practice, the number of activities involved in the analysis is unpredictable and it is difficult to collect a large set of ground truth data for each activity, which makes supervised learning algorithms unsuitable for our system. Therefore, we use unsupervised learning approaches to recognize activities.

FRIEND-MATCHING GRAPH CONSTRUCTION:

To characterize relations among users, in this section, we propose the friend-matching graph to represent the similarity between their life styles and how they influence other people in the graph. In particular, we use the link weight between two users to represent the similarity of their life styles. Based on the friend-matching graph, we can obtain a user’s affinity reflecting how likely this user will be chosen as another user’s friend in the network. We define a new similarity metric to measure the similarity between two life style vectors. Based on the similarity metric, we model the relations between users in real life as a friend-matching graph. The friend-matching graph has been constructed to reflect life style relations among users.

USER IMPACT RANKING:

The impact ranking means a user’s capability to establish friendships in the network. In other words, the higher the ranking, the easier the user can be made friends with, because he/she shares broader life styles with others. Once the ranking of a user is obtained, it provides guidelines to those who receive the recommendation list on how to choose friends. The ranking itself, however, should be independent from the query user. In other words, the ranking depends only on the graph structure of the friend-matching graph, which contains two aspects: 1) how the edges are connected; 2) how much weight there is on every edge. Moreover, the ranking should be used together with the similarity scores between the query user and the potential friend candidates, so that the recommended friends are those who not only share sufficient similarity with the query user, and are also popular ones through whom the query user can increase their own impact rankings.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

INDEX PAGE:

ADMIN LOGIN:

ADMIN HOME PAGE:

USER LIST:

NEW USER REGISTRATION:

USER LOGIN:

USERHOME PAGE:

ADDING FRIENDS:

MY FRIENDS LIST:

RECOMMEND SITES FROM FRIENDS:

INDEX PAGE:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION & FUTURE ENHANCEMENT:

In this paper, we presented the design and implementation of Friendbook, a semantic-based friend recommendation system for social networks. Different from the friend recommendation mechanisms relying on social graphs in existing social networking services, Friendbook extracted life styles from user-centric data collected from sensors on the smartphone and recommended potential friends to users if they share similar life styles. We implemented Friendbook on the Android-based smartphones, and evaluated its performance on both smallscale experiments and large-scale simulations. The results showed that the recommendations accurately reflect the preferences of users in choosing friends. Beyond the current prototype, the future work can be four-fold. First, we would like to evaluate our system on large-scale field experiments. Second, we intend to implement the life style extraction using LDA and the iterative matrix-vector multiplication method in user impact ranking incrementally, so that Friendbook would be scalable to large-scale systems. Third, the similarity threshold used for the friend-matching graph is fixed in our current prototype of Friendbook.

Our explore the adaption of the threshold for each edge and see whether it can better represent the similarity relationship on the friend-matching graph. At last, we plan to incorporate more sensors on the mobile phones into the system and also utilize the information from wearable equipments (e.g., Fitbit, iwatch, Google glass, Nike+, and Galaxy Gear) to discover more interesting and meaningful life styles. For example, we can incorporate the sensor data source from Fitbit, which extracts the user’s daily fitness infograph, and the user’s place of interests from GPS traces to generate an infograph of the user as a “document”. From the infograph, one can easily visualize a user’s life style which will make more sense on the recommendation. Actually, we expect to incorporate Friendbook into existing social services (e.g., Facebook, Twitter, LinkedIn) so that Friendbook can utilize more information for life discovery, which should improve the recommendation experience in the future.

Energy Efficient Virtual Network Embedding for Cloud Networks

05/08/201902/07/2019 by admin

In this paper, we propose an energy efficient virtual network embedding (EEVNE) approach for cloud computing networks, where power savings are introduced by consolidating resources in the network and data centers. We model our approach in an IP over WDM network using mixed integer linear programming (MILP). The performance of the EEVNE approach is compared with two approaches from the literature: the bandwidth cost approach (CostVNE) and the energy aware approach (VNE-EA). The CostVNE approach optimizes the use of available bandwidth, while the VNE-EA approach minimizes the power consumption by reducing the number of activated nodes and links without taking into account the granular power consumption of the data centers and the different network devices.

The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model under an energy inefficient data center power profile. We develop a heuristic, real-time energy optimized VNE (REOViNE), with power savings approaching those of the EEVNE model. We also compare the different approaches adopting energy efficient data center power profile. Furthermore, we study the impact of delay and node location constraints on the energy efficiency of virtual network embedding. We also show how VNE can impact the design of optimally located data centers for minimal power consumption in cloud networks. Finally, we examine the power savings and spectral efficiency benefits that VNE offers in optical orthogonal division multiplexing networks.

INTRODUCTION:

The ever growing uptake of cloud computing as a widely accepted computing paradigm calls for novel architectures to support QoS and energy efficiency in networks and data centers. Estimates indicate that in the long term, if current trends continue, the annual energy bill paid by data center operators will exceed the cost of equipment. Given the ecological and economic impact, both academia and industry are focusing efforts on developing energy efficient paradigms for cloud computing. In, the authors stated that the success of future cloud networks where clients are expected to be able to specify the data rate and processing requirements for hosted applications and services will greatly depend on network virtualization. The form of cloud computing service offering under study here is Infrastructure as a Service (IaaS). IaaS is the delivery of virtualized and dynamically scalable computing power, storage and networking on demand to clients on a pay as you go basis.

Network virtualization allows multiple heterogeneous virtual network architectures (comprising virtual nodes and links) to coexist on a shared physical platform, known as the substrate network which is owned and operated by an infrastructure provider (InP) or cloud service provider whose aim is to earn a profit from leasing network resources to its customers (Service Providers (SPs)). It provides scalability, customised and on demand allocation of resources and the promise of efficient use of network resources. Network virtualization is therefore a strong proponent for the realization of an efficient IaaS framework in cloud networks. InPs should have a resource allocation framework that reserves and allocates physical resources to elements such as virtual nodes and virtual links. Resource allocation is done using a class of algorithms commonly known as “virtual network embedding (VNE)” algorithms. The dynamic mapping of virtual resources onto the physical hardware maximizes the benefits gained from existing hardware. The VNE problem can be either Offline or Online. In offline problems all the virtual network requests (VNRs) are known and scheduled in advance while for the online problem, VNRs arrive dynamically and can stay in the network for an arbitrary duration.

Both online and offline problems are known to be NPhard. With constraints on virtual nodes and links, the offline VNE problem can be reduced to the NP-hard multiway separator problem, as a result, most of the work done in this area has focused on the design of heuristic algorithms and the use of networks with minimal complexity when solving mixed integer linear programming (MILP) models. Network virtualization has been proposed as an enabler of energy savings by means of resource consolidation. In all these proposals, the VNE models and/or algorithms do not address the link embedding problem as a multi-layer problem spanning from the virtualization layer through the IP layer and all the way to the optical layer. Except for the authors in, the others do not consider the power consumption of network ports/links as being related to the actual traffic passing through them.

On the contrary, we take a very generic, detailed and accurate approach towards energy efficient VNE (EEVNE) where we allow the model to decide the optimum approach to minimize the total network and data centers server power consumption. We consider the granular power consumption of various network elements that form the network engine in backbone networks as well as the power consumption in data centers. We develop a MILP model and a real-time heuristic to represent the EEVNE approach for clouds in IP over WDM networks with data centers. We study the energy efficiency considering two different power consumption profiles for servers in data centers; An energy inefficient power profile and an energy efficient power profile. Our work also investigates the impact of location and delay constraints in a practical enterprise solution of VNE in clouds. Furthermore we show how VNE can impact the design problem of optimally locating data centers for minimal power consumption in cloud networks.

LITRATURE SURVEY:

RESOURCE ALLOCATION IN A NETWORK-BASED CLOUD COMPUTING ENVIRONMENT: DESIGN CHALLENGES

AUTHOR: M. A. Sharkh, M. Jammal, A. Shami, and A. Ouda

PUBLISH: IEEE Commun. Mag., vol. 51, no. 11, pp. 46–52, 2013.

EXPLANATION:

Cloud computing is a utility computing paradigm that has become a solid base for a wide array of enterprise and end-user applications. Providers offer varying service portfolios that differ in resource configurations and provided services. A comprehensive solution for resource allocation is fundamental to any cloud computing service provider. Any resource allocation model has to consider computational resources as well as network resources to accurately reflect practical demands. Another aspect that should be considered while provisioning resources is energy consumption. This aspect is getting more attention from industrial and government parties. Calls for the support of green clouds are gaining momentum. With that in mind, resource allocation algorithms aim to accomplish the task of scheduling virtual machines on the servers residing in data centers and consequently scheduling network resources while complying with the problem constraints. Several external and internal factors that affect the performance of resource allocation models are introduced in this article. These factors are discussed in detail, and research gaps are pointed out. Design challenges are discussed with the aim of providing a reference to be used when designing a comprehensive energy-aware resource allocation model for cloud computing data centers.

DISTRIBUTED ENERGY EFFICIENT CLOUDS OVER CORE NETWORKS

AUTHOR: A. Q. Lawey, T. E. H. El-Gorashi, and J. M. H. Elmirghani

PUBLISH: IEEE J. Lightw. Technol., vol. 32, no. 7, pp. 1261–1281, Jan. 2014.

EXPLANATION:

In this paper, we introduce a framework for designing energy efficient cloud computing services over non-bypass IP/WDM core networks. We investigate network related factors including the centralization versus distribution of clouds and the impact of demand, content popularity and access frequency on the clouds placement, and cloud capability factors including the number of servers, switches and routers and amount of storage required in each cloud. We study the optimization of three cloud services: cloud content delivery, storage as a service (StaaS), and virtual machines (VMS) placement for processing applications. First, we develop a mixed integer linear programming (MILP) model to optimize cloud content delivery services. Our results indicate that replicating content into multiple clouds based on content popularity yields 43% total saving in power consumption compared to power un-aware centralized content delivery. Based on the model insights, we develop an energy efficient cloud content delivery heuristic, DEER-CD, with comparable power efficiency to the MILP results. Second, we extend the content delivery model to optimize StaaS applications. The results show that migrating content according to its access frequency yields up to 48% network power savings compared to serving content from a single central location. Third, we optimize the placement of VMs to minimize the total power consumption. Our results show that slicing the VMs into smaller VMs and placing them in proximity to their users saves 25% of the total power compared to a single virtualized cloud scenario. We also develop a heuristic for real time VM placement (DEER-VM) that achieves comparable power savings.

Reducing power consumption in embedding virtual infrastructures

AUTHOR: B. Wang, X. Chang, J. Liu, and J. K. Muppala

PUBLISH: c. IEEE Globecom Workshops, Dec. 3–7, 2012, pp. 714–718.

EXPLANATION:

Network virtualization is considered to be not only an enabler to overcome the inflexibility of the current Internet infrastructure but also an enabler to achieve an energy-efficient Future Internet. Virtual network embedding (VNE) is a critical issue in network virtualization technology. This paper explores a joint power-aware node and link resource allocation approach to handle the VNE problem with the objective of minimizing energy consumption. We first present a generalized power consumption model of embedding a VN. Then we formulate the problem as a mixed integer program and propose embedding algorithms. Simulation results demonstrate that the proposed algorithms perform better than the existing algorithms in terms of the power consumption in the overprovisioned scenarios.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods of disaster-resilient optical datacenter networks through integer linear programming (ILP) and heuristics addressed content placement, routing, and protection of network and content for geographically distributed cloud services delivered by optical networks models and heuristics are developed to minimize delay and power consumption of clouds over IP/WDM networks. The authors of exploited anycast routing by intelligently selecting destinations and routes for users traffic served by clouds over optical networks, as opposed to unicast traffic, while switching off unused network elements. A unified, online, and weighted routing and scheduling algorithm is presented in for a typical optical cloud infrastructure considering the energy consumption of the network and IT resources.

In the authors provided an optimization-based framework, where the objective functions range from minimizing the energy and bandwidth cost to minimizing the total carbon footprint subject to QoS constraints. Their model decides where to build a data center, how many servers are needed in each datacenter and how to route requests. In we built a MILP model to study the energy efficiency of public cloud for content delivery over non-bypass IP/WDM core networks. The model optimizes clouds external factors including the location of the cloud in the IP/WDM network and whether the cloud should be centralized or distributed and cloud internal capability factors including the number of servers, internal LAN switches, routers, and amount of storage required in each cloud.

2.1.1 DISADVANTAGES:

(i) Studying the impact of small content (storage) size on the energy efficiency of cloud content delivery

(ii) Developing a real time heuristic for energy aware content delivery based on the content delivery model insights,

(iii) Extending the content delivery model to study the Storage as a Service (StaaS) application,

(iv) ILP model for energy aware cloud VM placement and designing a heuristic to mimic the model behaviour in real time.

2.2 PROPOSED SYSTEM:

We developed a MILP model which attempts to minimize the bandwidth cost of embedding a VNR. In the virtual network embedding energy aware (VNE-EA) model minimized the energy consumption by imposing the notion that the power consumption is minimized by switching off substrate links and nodes. The authors also assume that the power saved in switching off a substrate link is the same as the power saved by switching off a substrate node.

In the authors assumed that the power consumption in the network is insensitive to the number of ports used. They also seek to minimize the number of active working nodes and links. Botero and Hesselbach have proposed a model for energy efficiency using load balancing and have also developed a dynamic heuristic that reconfigures the embedding for energy efficiency once it is performed. They have implemented and evaluated their MILP models and heuristic algorithms using the ALEVIN Framework. The ALEVIN Framework is a good tool for developing, comparing and analyzing VNE algorithms.

The performance of the EEVNE approach is compared with two approaches from the literature: the bandwidth cost approach (CostVNE) and the energy aware approach (VNE-EA). The CostVNE approach optimizes the use of available bandwidth, while the VNE-EA approach minimizes the power consumption by reducing the number of activated nodes and links without taking into account the granular power consumption of the data centers and the different network devices.

The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model under energy inefficient data center power profile. We develop a heuristic, real-time energy optimized VNE (REOViNE), with power savings approaching those of the EEVNE model.

2.2.1 ADVANTAGES:

We are however unable to compare our model and heuristic to the implemented algorithms on the platform for the following reasons:

1. Our input parameters are not compatible to the existing models and algorithms on the platform. Extensive extensions to the algorithms and models would be needed for them to include the optical layer. Our parameters include among others; the distance in km between links for us to determine the number of EDFA’s or Regenerators needed on a link, the wavelength rate, the number of wavelengths in a fiber, the power consumption of EDFAs, transponders, regenerators, router ports, optical cross connects, multiplexers, de-multiplexers, etc.

2. The assumptions made in the calculation of power in our model and the models on the platform are different. We define the power consumption to its fine granularity to include power consumed due to traffic on each element that forms the network engine. One of our main contributions in this work is the inclusion of the optical layer in link embedding which is currently not supported by any of the algorithms on the ALEVIN platform.

We developed a generalized power consumption model of embedding a VNR and formulated it as a MILP model; however, they also assumed that the power consumption of the network ports is independent of traffic. In the authors propose a trade-off between maximizing the number of VNRs that can be accommodated by the InP and minimizing the energy cost of the whole system. They propose embedding requests in regions with the lowest electricity cost.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

SENSOR NODE:

MOBILE RELAY NODE:

SINK:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

NSFNET NETWORK:

To evaluate the performance of the proposed model and heuristic, the NSFNET network is used as the substrate network. NSFNET comprises 14 nodes and 21 links as shown in Fig. 6. We consider a scenario in which each node in NSFNET hosts a small data center of 500 servers to offer cloud services. Table I shows the parameters used. The power consumption of the network devices we have used are consistent with our previous work in which are derived from. The IP router ports are the most energy consuming devices in the network. We have adopted the Dell Power Edge R720 [26] server power specifications. We adapted the CostVNE model and the VNE-EA model for the IP over WDM network architecture.

Our EEVNE model and the REOViNE heuristic in terms of power consumption and number of accepted requests objective functions of the two models in the CostVNE model has resulted in the minimum network power consumption as it optimizes the use of bandwidth of the substrate network by consolidating wavelengths regardless of the number of data centers activated (see Fig. 7(a)). Compared to the EEVNE model, the CostVNE model has saved a maximum of 5% (average 3%) of the network power consumption. The EEVNE model, where the energy consumption is minimized by jointly optimizing the use of network resources and consolidating resources in data centers, has resulted in better power savings compared to the VNE-EA model where the power consumption is minimized by switching off substrate links and nodes. This is because the network power consumption is mainly a function of the number of wavelengths rather than the number of active links as the number of wavelengths used determines the power consumption of router ports and transponders, the most power consuming devices in the network (see Table I). The REOViNE heuristic approaches the EEVNE model in terms of the network power consumption.

Power consumption of data centers under the different models and heuristic. As mentioned above, the CostVNE model does not take into account the number of activated data centers, therefore it performs very poorly as far as the power consumption in data centers is concerned. However, as the network gets fully loaded and all the data centers are activated, the EEVNE model loses its merit over the CostVNE model. For a limited number of requests, the VNE-EA model performs just as good as the EEVNE model. However as the number of requests increases, the VNE-EA model tends to route the virtual links through multiple hops to minimize the number of activated links and data centers and therefore consumes more power. The REOViNE heuristic also approaches the EEVNE performance in terms of the data centers power consumption.

4.1 ALGORITHM:

VIRTUAL NETWORK EMBEDDING (VNE):

Resource allocation is done using a class of algorithms commonly known as “virtual network embedding (VNE)” algorithms. The dynamic mapping of virtual resources onto the physical hardware maximizes the benefits gained from existing hardware. The VNE problem can be either Offline or Online. In offline problems all the virtual network requests (VNRs) are known and scheduled in advance while for the online problem, VNRs arrive dynamically and can stay in the network for an arbitrary duration. Both online and offline problems are known to be NPhard. With constraints on virtual nodes and links, the offline VNE problem can be reduced to the NP-hard multiway separator problem, as a result, most of the work done in this area has focused on the design of heuristic algorithms and the use of networks with minimal complexity when solving mixed integer linear programming (MILP) models.

Network virtualization has been proposed as an enabler of energy savings by means of resource consolidation. In all these proposals, the VNE models and/or algorithms do not address the link embedding problem as a multi-layer problem spanning from the virtualization layer through the IP layer and all the way to the optical layer. Except for the authors in [14], the others do not consider the power consumption of network ports/links as being related to the actual traffic passing through them. On the contrary, we take a very generic, detailed and accurate approach towards energy efficient VNE (EEVNE) where we allow the model to decide the optimum approach to minimize the total network and data centers server power consumption.

We consider the granular power consumption of various network elements that form the network engine in backbone networks as well as the power consumption in data centers. We develop a MILP model and a real-time heuristic to represent the EEVNE approach for clouds in IP over WDM networks with data centers. We study the energy efficiency considering two different power consumption profiles for servers in data centers; an energy inefficient power profile and an energy efficient power profile. Our work also investigates the impact of location and delay constraints in a practical enterprise solution of VNE in clouds. Furthermore we show how VNE can impact the design problem of optimally locating data centers for minimal power consumption in cloud networks.

4.2 MODULES:

SERVER CLIENT MODULE:

VIRTUAL NETWORK EMBEDDING:

MILP MODEL FOR EEVNE:

ENERGY EFFICIENT NETWORKS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

VIRTUAL NETWORK EMBEDDING:

MILP MODEL FOR EEVNE:

ENERGY EFFICIENT NETWORKS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION:

This paper has investigated the energy efficiency of virtual network embedding in IP over WDM networks. We developed a MILP model (EEVNE) and a heuristic (REOViNE) to optimize the use of wavelengths in the network in addition to consolidating the use of resources in data centers. The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model which minimizes the bandwidth cost of embedding a VNR. The EEVNE model has also higher power savings compared to the virtual network embedding energy aware (VNE-EA) model from the literature. We have demonstrated that when it comes to energy savings in the network, it is not sufficient to develop models that just turn off links and nodes in the network but it is important to consider all the power consuming devices in the network and then minimize their power consumption as a whole. The REOViNE heuristic’s power savings and number of accepted requests approaches those of the MILP model. We have also investigated the performance of the models under non uniform load distributions showing that EEVNE model has superior power savings in most load conditions.

8.2 FUTURE WORK:

We have gone further to show the energy efficiency of VNE considering energy efficient data center power profile. The results show that the optimal VNE approach with the minimum power consumption is the one that only minimizes the use of network bandwidth, in this case, the CostVNE model. This however only applies when it is assumed that the network is not reconfigured when embedding new requests. We have also studied the power savings achieved by removing geographical redundancy constraints when embedding protection and load balancing virtual nodes and observed that the power savings obtained as a result can guide service providers in determining cost reductions offered to enterprise customers not requiring full geographical redundancy. We have shown how VNE can impact the optimal locations of data centers for minimal network power consumption in cloud networks. The results show that the selection of a location to host a data center is governed by two factors: the average hop count to other nodes and the client population of the candidate node and its neighbours (assuming a given average rate per user). Finally, we have developed a MILP model for VNE in O-OFDM based cloud networks and shown that they have improved power and spectral efficiency compared to conventional WDM based networks.

Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictionaries over Encrypted Clou

05/08/201902/07/2019 by admin

Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictionaries over Encrypted Cloud DataAbstract—Using cloud computing, individuals can store their data on remote servers and allow data access to public users through thecloud servers. As the outsourced data are likely to contain sensitive privacy information, they are typically encrypted before uploaded tothe cloud. This, however, significantly limits the usability of outsourced data due to the difficulty of searching over the encrypted data. Inthis paper, we address this issue by developing the fine-grained multi-keyword search schemes over encrypted cloud data. Our originalcontributions are three-fold. First, we introduce the relevance scores and preference factors upon keywords which enable the precisekeyword search and personalized user experience. Second, we develop a practical and very efficient multi-keyword search scheme.The proposed scheme can support complicated logic search the mixed “AND”, “OR” and “NO” operations of keywords. Third, we furtheremploy the classified sub-dictionaries technique to achieve better efficiency on index building, trapdoor generating and query. Lastly,we analyze the security of the proposed schemes in terms of confidentiality of documents, privacy protection of index and trapdoor,and unlinkability of trapdoor. Through extensive experiments using the real-world dataset, we validate the performance of the proposedschemes. Both the security analysis and experimental results demonstrate that the proposed schemes can achieve the same securitylevel comparing to the existing ones and better performance in terms of functionality, query complexity and efficiency.Index Terms—Searchable encryption, Multi-keyword, Fine-grained, Cloud computing.F1 INTRODUCTIONTHE cloud computing treats computing as a utility andleases out the computing and storage capacities to thepublic individuals [1], [2], [3]. In such a framework, theindividual can remotely store her data on the cloud server,namely data outsourcing, and then make the cloud data openfor public access through the cloud server. This represents amore scalable, low-cost and stable way for public data accessbecause of the scalability and high efficiency of cloud servers,and therefore is favorable to small enterprises._ H. Li and Y. Yang are with the School of Computer Science andEngineering, University of Electronic Science and Technology of China,Chengdu, Sichuan, China (e-mail: hongweili@uestc.edu.cn; yangyi.buku@gmail.com)._ H. Li is with State Key Laboratory of Information Security (Institute ofInformation Engineering, Chinese Academy of Sciences, Beijing 100093)(e-mail: hongweili@uestc.edu.cn)._ T. Luan is with the School of Information Technology, Deakin University,Melbourne, Australia(e-mail: tom.luan@deakin.edu.au)._ X. Liang is with the Department of Computer Science, Dartmouth College,Hanover, USA(e-mail: Xiaohui.Liang@dartmouth.edu)._ L. Zhou is with the National Key Laboratory of Science and Technologyon Communication, University of Electronic Science and Technology ofChina, China(e-mail: lzhou@uestc.edu.cn)._ X. Shen is with the Department of Electrical and Computer Engineering,University of Waterloo,Waterloo, Ontario, Canada(e-mail:sshen@uwaterloo.ca).Note that the outsourced data may contain sensitive privacyinformation. It is often necessary to encrypt the private databefore transmitting the data to the cloud servers [4], [5].The data encryption, however, would significantly lower theusability of data due to the difficulty of searching over theencrypted data [6]. Simply encrypting the data may stillcause other security concerns. For instance, Google Searchuses SSL (Secure Sockets Layer) to encrypt the connectionbetween search user and Google server when private data,such as documents and emails, appear in the search results [7].However, if the search user clicks into another website fromthe search results page, that website may be able to identifythe search terms that the user has used.On addressing above issues, the searchable encryption (e.g.,[8], [9], [10]) has been recently developed as a fundamentalapproach to enable searching over encrypted cloud data,which proceeds the following operations. Firstly, the dataowner needs to generate several keywords according to theoutsourced data. These keywords are then encrypted and storedat the cloud server. When a search user needs to access theoutsourced data, it can select some relevant keywords andsend the ciphertext of the selected keywords to the cloudserver. The cloud server then uses the ciphertext to matchthe outsourced encrypted keywords, and lastly returns thematching results to the search user. To achieve the similarsearch efficiency and precision over encrypted data as that ofplaintext keyword search, an extensive body of research hasbeen developed in literature. Wang et al. [11] propose a rankedkeyword search scheme which considers the relevance scores1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing2of keywords. Unfortunately, due to using order-preservingencryption (OPE) [12] to achieve the ranking property, theproposed scheme cannot achieve unlinkability of trapdoor.Later, Sun et al. [13] propose a multi-keyword text searchscheme which considers the relevance scores of keywords andutilizes a multidimensional tree technique to achieve efficientsearch query. Yu et al. [14] propose a multi-keyword top-kretrieval scheme which uses fully homomorphic encryption toencrypt the index/trapdoor and guarantees high security. Caoet al. [6] propose a multi-keyword ranked search (MRSE),which applies coordinate machine as the keyword matchingrule, i.e., return data with the most matching keywords.Although many search functionalities have been developedin previous literature towards precise and efficient searchableencryption, it is still difficult for searchable encryption toachieve the same user experience as that of the plaintextsearch, like Google search. This mainly attributes to followingtwo issues. Firstly, query with user preferences is very popularin the plaintext search [15], [16]. It enables personalized searchand can more accurately represent user’s requirements, but hasnot been thoroughly studied and supported in the encrypteddata domain. Secondly, to further improve the user’s experienceon searching, an important and fundamental function isto enable the multi-keyword search with the comprehensivelogic operations, i.e., the “AND”, “OR” and “NO” operationsof keywords. This is fundamental for search users to prunethe searching space and quickly identify the desired data.Cao et al. [6] propose the coordinate matching search scheme(MRSE) which can be regarded as a searchable encryptionscheme with “OR” operation. Zhang et al. [17] propose aconjunctive keyword search scheme which can be regarded asa searchable encryption scheme with “AND” operation withthe returned documents matching all keywords. However, mostexisting proposals can only enable search with single logicoperation, rather than the mixture of multiple logic operationson keywords, which motivates our work.In this work, we address above two issues by developingtwo Fine-grained Multi-keyword Search (FMS) schemes overencrypted cloud data. Our original contributions can be summarizedin three aspects as follows:• We introduce the relevance scores and the preference factorsof keywords for searchable encryption. The relevancescores of keywords can enable more precise returnedresults, and the preference factors of keywords representthe importance of keywords in the search keyword setspecified by search users and correspondingly enablespersonalized search to cater to specific user preferences. Itthus further improves the search functionalities and userexperience.• We realize the “AND”, “OR” and “NO” operations in themulti-keyword search for searchable encryption. Comparedwith schemes in [6], [13] and [14], the proposedscheme can achieve more comprehensive functionalityand lower query complexity.• We employ the classified sub-dictionaries technique toenhance the efficiency of the above two schemes. Extensiveexperiments demonstrate that the enhanced schemescan achieve better efficiency in terms of index building,trapdoor generating and query in the comparison withschemes in [6], [13] and [14].The remainder of this paper is organized as follows. InSection 2, we outline the system model, threat model, securityrequirements and design goals. In Section 3, we describethe preliminaries of the proposed schemes. We present thedeveloped schemes and enhanced schemes in details in Section4 and Section 5, respectively. Then we carry out the securityanalysis and performance evaluation in Section 6 and Section7, respectively. Section 8 provides a review of the relatedworks and Section 9 concludes the paper.2 SYSTEM MODEL, THREAT MODELAND SECURITY REQUIREMENTS2.1 System ModelAs shown in Fig. 1, we consider a system consists of threeentities.• Data owner: The data owner outsources her data tothe cloud for convenient and reliable data access to thecorresponding search users. To protect the data privacy,the data owner encrypts the original data throughsymmetric encryption. To improve the search efficiency,the data owner generates some keywords for eachoutsourced document. The corresponding index is thencreated according to the keywords and a secret key. Afterthat, the data owner sends the encrypted documents andthe corresponding indexes to the cloud, and sends thesymmetric key and secret key to search users.• Cloud server: The cloud server is an intermediate entitywhich stores the encrypted documents and correspondingindexes that are received from the data owner, andprovides data access and search services to search users.When a search user sends a keyword trapdoor to the cloudserver, it would return a collection of matching documentsbased on certain operations.• Search user: A search user queries the outsourced documentsfrom the cloud server with following three steps.First, the search user receives both the secret key andsymmetric key from the data owner. Second, accordingto the search keywords, the search user uses the secretkey to generate trapdoor and sends it to the cloud server.Last, she receives the matching document collection fromthe cloud server and decrypts them with the symmetrickey.2.2 Threat Model and Security RequirementsIn our threat model, the cloud server is assumed to be “honestbut-curious”, which is the same as most related works onsecure cloud data search [13], [14], [6]. Specifically, the cloudserver honestly follows the designated protocol specification.However, the cloud server could be “curious” to infer andanalyze data (including index) in its storage and messageflows received during the protocol so as to learn additionalinformation. we consider two threat models depending on theinformation available to the cloud server, which are also usedin [13], [6].1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing3Fig. 1. System model• Known Ciphertext Model: The cloud server can onlyknow encrypted document collection C and index collectionI, which are both outsourced from the data owner.• Known Background Model: The cloud server can possessmore knowledge than what can be accessed inthe known ciphertext model, such as the correlationrelationship of trapdoors and the related statistical ofother information, i.e., the cloud server can possess thestatistical information from a known comparable datasetwhich bears the similar nature to the targeting dataset.Similar to [13], [6], we assume search users are trustedentities, and they share the same symmetric key and secretkey. Search users have pre-existing mutual trust with thedata owner. For ease of illustration, we do not considerthe secure distribution of the symmetric key and the secretkey between the data owner and search users; it can beachieved through regular authentication and secure channelestablishment protocols based on the prior security contextshared between search users and the data owner [18]. Inaddition, to make our presentations more focused, we donot consider following issues, including the access controlproblem on managing decryption capabilities given to usersand the data collection’s updating problem on inserting newdocuments, updating existing documents, and deleting existingdocuments, are separated issues. The interested readers onabove issues may refer to [6], [5], [10], [19].Based on the above threat model, we define the securityrequirements as follows:• Confidentiality of documents: The outsourced documentsprovided by the data owner are stored in the cloud server.If they match the search keywords, they are sent to thesearch user. Due to the privacy of documents, they shouldnot be identifiable except by the data owner and theauthorized search users.• Privacy protection of index and trapdoor: As discussed inSection 2.1, the index and the trapdoor are created basedon the documents’ keywords and the search keywords,respectively. If the cloud server identifies the content ofindex or trapdoor, and further deduces any associationbetween keywords and encrypted documents, it may learnthe major subject of a document, even the content of ashort document [20]. Therefore, the content of index andtrapdoor cannot be identified by the cloud server.• Unlinkability of trapdoor: The documents stored in thecloud server may be searched many times. The cloudserver should not be able to learn any keyword informationaccording to the trapdoors, e.g., to determine twotrapdoors which are originated from the same keywords.Otherwise, the cloud server can deduce relationship oftrapdoors, and threaten to the privacy of keywords. Hencethe trapdoor generation function should be randomized,rather than deterministic. Even in case that two searchkeyword sets are the same, the trapdoors should bedifferent.3 PRELIMINARIESIn this section, we define the notation and review the securekNN computation and relevance score, which will serve as thebasis of the proposed schemes.3.1 Notation• F—the document collection to be outsourced, denoted asa set of N documents F = (F1; F2; · · · ; FN).• C—the encrypted document collection according to F,denoted as a set of N documents C = (C1;C2; · · · ;CN).• FID—the identity collection of encrypted documents C,denoted as FID = (FID1; FID2; · · · ; FIDN).• W—the keyword dictionary, including m keywords, denotedas W = (w1;w2; · · · ;wm).• I—the index stored in the cloud server, which is builtfrom the keywords of each document, denoted as I =(I1; I2; · · · ; IN).• fW—the query keyword set generated by a search user,which is a subset of W.• TfW—the trapdoor for keyword set fW.• ]FID—the identity collection of documents returned tothe search user.• FMS(CS)—the abbreviation of FMS and FMSCS.3.2 Secure kNN ComputationWe adopt the work of Wong et al. [21] as our foundation.Wong et al. propose a secure k-nearest neighbor (kNN) schemewhich can confidentially encrypt two vectors and computeEuclidean distance of them. Firstly, the secret key (S;M1;M2)should be generated. The binary vector S is a splitting indicatorto split plaintext vector into two random vectors, whichcan confuse the value of plaintext vector. And M1 and M2 areused to encrypt the split vectors. The correctness and securityof secure kNN computation scheme can be referred to [21].3.3 Relevance ScoreThe relevance score between a keyword and a documentrepresents the frequency that the keyword appears in thedocument. It can be used in searchable encryption for returningranked results. A prevalent metric for evaluating the relevancescore is TF × IDF, where TF (term frequency) representsthe frequency of a given keyword in a document and IDF1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing4(inverse document frequency) represents the importance ofkeyword within the whole document collection. Without lossof generality, we select a widely used expression in [22] toevaluate the relevance score asScore(fW; Fj) =Σw∈fW1|Fj |· (1 + lnfj;w) · ln(1 +Nfw) (1)where fj;w denotes the TF of keyword w in document Fj ;fw denotes the number of documents contain keyword w; Ndenotes the number of documents in the collection; and |Fj |denotes the length of Fj , obtained by counting the number ofindexed keywords.4 PROPOSED SCHEMESIn this section, we firstly propose a variant of the secure kNNcomputation scheme, which serves as the basic framework ofour schemes. Furthermore, we describe two variants of ourbasic framework and the corresponding functionalities of themin detail.4.1 Basic FrameworkThe secure kNN computation scheme uses Euclidean distanceto select k nearest database records. In this section, we presenta variant of the secure kNN computation scheme to achievethe searchable encryption property.4.1.1 InitializationThe data owner randomly generates the secret key K =(S;M1;M2), where S is a (m+1)-dimensional binary vector,M1 and M2 are two (m + 1) × (m + 1) invertible matrices,respectively, and m is the number of keywords in W. Thenthe data owner sends (K; sk) to search users through a securechannel, where sk is the symmetric key used to encryptdocuments outsourced to the cloud server.4.1.2 Index buildingThe data owner firstly utilizes symmetric encryption algorithm(e.g., AES) to encrypt the document collection(F1; F2; · · · ; FN) with the symmetric key sk [23], the encrypteddocument collection are denoted as Cj(j = 1; 2; · · · ;N).Then the data owner generates an m-dimensional binaryvector P according to Cj(j = 1; 2; · · · ;N), where eachbit P[i] indicates whether the encrypted document containsthe keyword wi, i.e., P[i] = 1 indicates yes and P[i] = 0indicates no. Then she extends P to a (m + 1)-dimensionalvector P′, where P′[m + 1] = 1. The data owner usesvector S to split P′ into two (m + 1)-dimensional vectors(pa; pb), where the vector S functions as a splitting indicator.Namely, if S[i] = 0(i = 1; 2; · · · ;m + 1), pa[i] and pb[i]are both set as P′[i]; if S[i] = 1(i = 1; 2; · · · ;m + 1),the value of P′[i] will be randomly split into pa[i] and pb[i](P′[i] = pa[i]+pb[i]). Then, the index of encrypted documentCj can be calculated as Ij = (paM1; pbM2). Finally, the dataowner sends Cj ||FIDj ||Ij (j = 1; 2; · · · ;N) to the cloudserver.4.1.3 Trapdoor generatingThe search user firstly generates the keyword set fW forsearching. Then, she creates a m-dimensional binary vector Qaccording to fW, where Q[i] indicates whether the i-th keywordof dictionary wi is in fW, i.e., Q[i] = 1 indicates yes andQ[i] = 0 indicates no. Further, the search user extends Q toa (m + 1)-dimensional vector Q′, where Q′[m + 1] = −s(the value of −s will be defined in the following schemesin detail). Next, the search user chooses a random numberr > 0 to generate Q′′ = r · Q′. Then she splits Q′′ into two(m + 1) vectors (qa; qb): if S[i] = 0(i = 1; 2; · · · ;m + 1),the value of Q′′[i] will be randomly split into qa[i] and qb[i];if S[i] = 1(i = 1; 2; · · · ;m + 1), qa[i] and qb[i] are bothset as Q′′[i]. Thus, the search trapdoor TfW can be generatedas (M−11 qa;M−12 qb). Then the search user sends TfW to thecloud server.4.1.4 QueryWith the index Ij(j = 1; 2; · · · ;N) and trapdoor TfW, thecloud server calculates the query result asRj = Ij · TfW = (paM1; pbM2) · (M−11 qa;M−12 qb)= pa · qa + pb · qb = P′ · Q′′= rP′ · Q′= r · (P · Q − s)(2)If Rj > 0, the corresponding document identity FIDj willbe returned.Discussions: The Basic Framework has defined the fundamentalsystem structure of the developed schemes. Based onthe secure kNN computation scheme [21], the complementaryrandom parameter r further enhances the security. Differentvalues for parameter s and vectors P and Q can lead to newvariants of the Basic Framework. This will be elaborated inthe follows.4.2 FMS IIn the Basic Framework, P is a m-dimensional binary vector,and each bit P[i] indicates whether the encrypted documentcontains the keyword wi. In the FMS I, the data ownerfirst calculates the relevance score between the keyword wiand document Fj . The relevance score can be calculated asfollows:Score(wi; Fj) =1|Fj |· (1 + lnfj;wi ) · ln(1 +Nfwi) (3)where fj;wi denotes the TF of keyword wi in document Fj ;fwi denotes the number of documents contain keyword wi; Ndenotes the number of documents in the collection; and |Fj |denotes the length of Fj , obtained by counting the number ofindexed keywords.Then the data owner replaces the value of P[i] with thecorresponding relevance score. On the other hand, we alsoconsider the preference factors of keywords. The preferencefactors of keywords indicate the importance of keywords inthe search keyword set personalized defined by the searchuser. For a search user, he may pay more attention to thepreference factors of keywords defined by himself than therelevance scores of the keywords. Thus, our goal is that1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing5if a document has a keyword with larger preference factorthan other documents, it should have a higher priority inthe returned ]FID; and for two documents, if their largestpreference factor keywords are the same, the document withhigher relevance score of the keyword is the better matchingresult.As shown in Fig. 2, we replace the values of P[i] andQ[i] by the relevance score and the preference factor of akeyword, respectively (thus P and Q are no longer binary).The search user can dynamically adjust the preference factorsto achieve a more flexible search. For convenience, the scoreis rounded up, i.e., Score(wi; Fj) = ⌈10 ∗ Score(wi; Fj)⌉,and we assume the relevance score is not more than D,i.e., Score(wi; Fj) < D. For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 ≤ n1 < n2 < · · · < nl ≤ m) whichis ordered by ascending importance, the search user randomlychooses a Σ super-increasing sequence (d1 > 0; d2; · · · ; dl) (i.e., j−1i=1 di ·D < dj(j = 2; 3; · · · ; l)), where di is the preferencefactor of keyword wni . Then the search result would be:Rj = r · (P · Q − s) = r · (Σli=1Score(wni ; Fj) · di − s) (4)Theorem 1: (Correctness) For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 ≤ n1 < n2 < · · · < nl ≤ m) whichis ordered by ascending preference factors, if F1 contains alarger preference factor keyword compared with F2, then F1has higher priority in the returned ]FID.Proof: For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl ), assume the keyword sets F1and F2 contain in fW are denoted as fW1 =(wni ; · · · ;wnx )(n1 ≤ ni < · · · < nx ≤ nl) andfW2 = (wnj ; · · · ;wny )(n1 ≤ nj < · · · < ny ≤ nl),respectively, where fW1 and fW2 are both ordered byascending preference factors, and nx > ny. As stated above,Score(wnx Σ ; Fj) ≥ 1 since the score is rounded up, and j−1i=1 di · D < dj(j = 2; 3; · · · ; l). Therefore, there will beR2 = r · (Σwnj∈gW2Score(wnj ; F2) · dj − s)< r · (Σyj=1Score(wnj ; F2) · dj − s)< r · (Σyj=1D · dj − s) < r · (dx − s)< r · (Score(wnx ; F1) · dx − s)< r · (Σwni∈gW1Score(wni ; F1) · di − s)< R1(5)Therefore, F1 has higher priority in the returned ]FID.Theorem 2: (Correctness) For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 ≤ n1 < n2 < · · · < nl ≤ m)which is ordered by ascending preference factors, if the largestpreference factor keyword F1 contains is the same as thatF2 contains, and F1 have the higher relevance score of thekeyword, then F1 have higher priority in the returned ]FID.Proof: For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl ), assume the keyword sets F1 andF2 contain are denoted as fW1 = (wni ; · · · ;wnx )(n1 ≤ni < · · · < nx ≤ nl) and fW2 = (wnj ; · · · ;wnx )(n1 ≤nj < · · · < nx ≤ nl), respectively, where fW1 and fW2are both ordered by ascending preference factors andScore(wnx ; F1) − Score(wnx ; F2) ≥ 1. Thus, there will beR1 =r · (Σwni∈gW1Score(wni ; F1) · di − s)≥ r · (Score(wnx ; F1) · dx − s)(7)R2 =r · (Σwnj∈gW2Score(wnj ; F2) · dj − s) (8)=r · (Score(wnx ; F2) · dx+Σwnj∈gW2−wnxScore(wnj ; F2) · dj − s)<r · (Score(wnx ; F2) · dx +Σwnj∈gW2−wnxD · dj − s)<r · (Score(wnx ; F2) · dx + dx − s)R1 − R2 > r · ((Score(wnx ; F1) − Score(wnx ; F2)) · dx − dx)> r · (dx − dx)> 0 (9)Therefore, F1 have higher priority in the returned ]FID thanF2.Example. We present a concrete example to help understandTheorem 2. The example also illustrates the workingprocess of FMS I. Specifically, we assume that thesearch keyword set is fW = (wn1 ;wn2 ; · · · ;wn5 ), and thelargest preference factor keyword of sets F1 and F2 is thesame, which is wn4 . In addition, we assume the keywordsets F1 and F2 are fW1 = (wn2 ;wn3 ;wn4 ) and fW2 =(wn1 ;wn3 ;wn4 ) respectively. Furthermore, we assume thatthe relevance score is not more than D = 5, and specially,let Score(wn4 ; F1) = 4 and Score(wn4 ; F2) = 2,which satisfy Score(wn4 ; F1) − Score(wn4 ; F2) = 2 ≥1. we randomly choose a super-increasing sequence di ={1; 10; 60; 500; 3000}(i = 1; · · · ; 5), for arbitrary r > 0, therewill beR1 =r · (Σwni∈gW1Score(wni ; F1) · di − s) (11)≥r · (Score(wn4 ; F1) · d4 − s)≥r · (4 · 500 − s)≥r · (2000 − s)1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing6Data owner Cloud server Search user111222 1 2 22keyword dictionary: ( , , )keywords of document : ( , , )corresponding ( , ): ( , , )in : ( , , , , , , , , )( 0 ,0 , , , , , , , ) ji jw wF w wScore w F S Sw w w w wP S S SP Pa ba b= ×××= ××××××××× ××× ××× ×××= ××× ××× ××× ×××® ¢®WWW W1 2( , )( , ) j a ab bp pI = p M p M´ jI_TW__ 2 1 211 22 1 2 2keyword dictionary: ( , , )search keyword set: ( , , )super-increasing sequence: ( , , )in : ( , , , , , , , , )( 0 , 0 , , , , , , l n n nln n nw ww w wd d dw w w w wQ d d da ba b= ×××= ××××××××× ××× ××× ×××= ××× ××× ×××WWW W_ 11 21, )( , )( , ) a ab bQ Q r Q q qT M q M q – -×××® ¢® × ¢®=1 W( )( ( , ) )larger , better result. j ilj ni j iR r P Q sr Score w F d sR== × × -= × _ × -Fig. 2. Structure of the FMS IR2 =r · (Σwnj∈gW2Score(wnj ; F2) · dj − s) (12)=r · (Score(wn4 ; F2) · d4+Σwnj∈gW2−wn4Score(wnj ; F2) · dj − s)<r · (Score(wn4 ; F2) · dx +Σwnj∈gW2−wn4D · dj − s)<r · (Score(wn4 ; F2) · d4 + d4 − s)<r · (2 · 500 + 500 − s)<r · (1500 − s)R1 − R2 >r · (2000 − s) − r · (1500 − s) (13)>r · (2000 − 1500)>500 · r > 04.3 FMS IIIn the FMS II, we do not change the vector P in the BasicFramework, but replace the value of Q[i] by the weight ofsearch keywords, as shown in Fig. 3. With the weight ofkeywords, we can also implement some operations like “OR”,“AND” and “NO” in the Google Search to the searchableencryption.Assume that the keyword sets corresponding to the“OR”, “AND” and “NO” operations are (w′1;w′2; · · · ;w′l1),(w′′1 ;w′′2 ; · · · ;w′′l2) and (w′′′1 ;w′′′2 ; · · · ;w′′′l3), respectively.Denote “OR”, “AND” and “NO” operations by ∨, ∧ and￢, respectively. Thus the matching rule can be representedas (w′1∨ w′2∨ · · · ∨ w′l1) ∧ (w′′1∧ w′′2∧ · · · ∧ w′′l2) ∧(￢w′′′1∧ ￢w′′′2∧ · · · ∧ ￢w′′′l3). For “OR” operation,the search user chooses a super-increasing sequence(a1 > 0; a2; · · · ; al1 )(Σj−1k=1 ak < aj(j = 2; · · · ; l1)) toachieve searching with keyword weight. To enable searchableencryption with “AND” and “NO” operations, the searchuser chooses a sequence (b1; b2; · · · ; bl2 ; c1; c2; · · · ; cl3 ),whereΣl1k=1 Σ ak < bh(h = 1; 2; · · · ; l2) and l1k=1 ak +Σl2h=1 bh < ci(i = 1; 2; · · · ; l3).Assume (w′1;w′2; · · · ;w′l1) are ordered by ascendingimportance, then according to the search keyword set(w′1;w′2; · · · ;w′l1;w′′1 ;w′′2 ; · · · ;w′′l2;w′′′1 ;w′′′2 ; · · · ;w′′′l3),the corresponding values in Q are set as(a1; a2; · · · ; al1 ; b1; b2; · · · ; bl2 ;−c1;−c2; · · · ;−cl3 ). Othervalues in Q are set as 0. Finally, the search user setss =Σl2h=1 bh. In the Query phase, For a document Fj , ifthe corresponding Rj > 0, we claim that Fj can satisfy theabove matching rule.Theorem 3: (Correctness) Fj satisfies the above matchingrule with “OR”, “AND” and “NO” if and only if the correspondingRj > 0.Proof: Firstly, we proof the completeness. Since the weightof w′′′Σ i (i = 1; 2; · · · ; l3) in the vector Q is −ci and ci > l1k=1 ak +Σl2h=1 bh, if any corresponding value of w′′′i in Pof Fj is 1, we can infer P ·Q < 0 and Rj = r·(P ·Q−s) < 0.Therefore, if Rj > 0, any of w′′′i is not in the keyword set ofFj , i.e., Fj satisfies the “NO” operation. Moreover, if Rj > 0,then r · (P · Q − s) = r · (P · Q −Σl2h=1 bh) > 0. Sincebh >Σl1k=1 ak(h = 1; 2; · · · ; l2), all corresponding values ofw′′h in P have to be 1 and at least one corresponding value ofw′k(k = 1; 2; · · · ; l1) in P should be 1. Thus, Fj satisfies the“AND” and “OR” operations. Therefore, if Rj > 0, the vectorP satisfies the operations of “OR”, “AND” and “NO”.Next, we show the soundness. If the vector P satisfiesthe operations of “OR”, “AND” and “NO”, i.e., at least onecorresponding value of keyword w′k in P is 1 (assume thiskeyword is w′(1 ≤ ≤ l1)), all corresponding values ofkeywords w′′h in P are 1 and no corresponding value ofkeyword w′′′i in P is 1. Therefore, Rj = r · (P · Q − s) ≥r · (a + b1 + b2 + · · · + bl2− s) = r · a > 0.Example.We present a concrete example to help understandTheorem 3. The example also illustrates the working processof FMS II. Specifically, we assume that the keyword setscorresponding to the “OR”, “AND” and “NO” operations are(w′1;w′2;w′3), (w′′1 ;w′′2 ;w′′3 ) and (w′′′1 ;w′′′2 ), respectively. Thus,the matching rule can be represented as (w′1∨ w′2∨ w′3) ∧(w′′1∧ w′′2∧ w′′3 ) ∧ (￢w′′′1∧ ￢w′′′2 ). we assume that the searchweights (a1; a2; a3), (b1; b2; b3) and (c1; c2) for “OR”, “AND”and “NO” are(1,5,8), (20,24,96) and (-500,-600), respectively.We firstly prove Rj > 0 when Fj satisfies the matchingrule. Specifically, assume that Fj satisfies the matching rulew′2∧(w′′1∧w′′2∧w′′3 )∧(￢w′′′1∧￢w′′′2 ). Thus the correspondingvalues of vector P are (0; 1; 0), (1; 1; 1) and (0; 0), respectively.Thus, the result of s =Σ3h=1 bh = 20 + 24 + 96 = 140,1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing7Data owner Cloud server Search user´ jI_T1 2 2 1 12 2 W1 2keyword dictionary: ( , , )keywords of document : ( , , )in : ( , , , , , , , , )( 0, 0 , , 1 , , 1 , , 1 , )( , )( , ) jj a ab bw wF w ww w w w wPP P p pI p M p Ma b= ×××= ×××××× ××× ××× ×××= ××× ××× ××× ×××® ¢®=WWW W 12 121 2 1122 11 22 1 2keyword dictionary: ( , , )operation keyword set keyword weightOR ( , , , ) ( , , , )AND ( , , , ) ( , , , )NO ( , , , ll llw ww w w a a aw w w b b bw w= ×××¢ ¢ ××× ¢ ×××¢ ¢ ××× ¢ ×××¢¢ ¢¢ ×××W_1 2 3 1 2 311 21) ( , , , )in : ( , , , , , , , , )( 0, 0 , , , , , , , )( , )( , ) l la ab bw c c cw w w w wQ a c bQ Q r Q q qT M q M qa b ca b c- -¢¢ ×××××× ¢ ××× ¢¢ ××× ¢ ×××= ××× ××× – ××× ×××® ¢® × ¢®=WW( )( )0 satisfy “OR”,”AND” and “NO”larger larger weight of “OR” j j jR r P Q sr a c b sRRa b c= × × -= × ×××+ +×××- +×××+ + ×××-> __Fig. 3. Structure of the FMS IIfor arbitrary r > 0, the result of Rj will beRj = r · (P · Q − s)= r · (a2 + b1 + b2 + b3 − s)= r · (5 + 20 + 24 + 96 − 140)= 5r > 0(14)From the above example, we can easily see that Rj > 0when Fj satisfies the matching rule. Next, we show thatRj < 0 when Fj does not satisfy the matching rule. Especially,we assume that the ”AND” operation does not satisfy thematching rule. Here, we set the first keyword does not matchthe rule, therefore the search keyword set of ”AND” operationsare (0; 1; 1) instead of (1; 1; 1). Thus, the result of Ri will beRj = r · (P · Q − s)= r · (a2 + b2 + b3 − s)= r · (5 + 24 + 96 − 140)= −15r < 0(15)Obviously, Rj < 0 when Fj does not satisfy the matchingrule.5 ENHANCED SCHEMEIn practice, apart from some common keywords, other keywordsin dictionary are generally professional terms, and thispart of the dictionary will rapidly increase when the dictionarybecomes larger and more comprehensive. Simultaneously, thedata owner’s index will become longer, although many dimensionsof keywords will never appear in her documents.That will cause redundant computation and communicationoverhead.In this section, we further propose a Fine-grained MultikeywordSearch scheme supporting Classified Sub-dictionaries(FMSCS), which classifies the total dictionary as a commonsub-dictionary and many professional sub-dictionaries. Ourgoal is to significantly reduce the computation and communicationoverhead. We have researched in a file set randomlychosen from the National Science Foundation (NSF) ResearchAwards Abstracts 1990-2003 [24]. As shown in Fig. 4, weclassify the total dictionary to many sub-dictionaries suchas common sub-dictionary, computer science sub-dictionary,mathematics sub-dictionary and physics sub-dictionary, etc.And the search process will only be some minor changes inInitialization.Change of Initialization: Compared with theBasic Framework, in the enhanced scheme thedata owner should firstly choose corresponding subdictionaries.Then her own dictionary can be combined as{f1||Subdic1||f2||Subdic2|| · · · }, where Subdici representsall keywords contained in corresponding sub-dictionary andfi is filling factor with random length which will be 0 stringin the index, the filling factor is used to confuse length ofthe data owner’s own dictionary and relative positions of subdictionaries.Then, the data owner and search user will use thisdictionary to generate the index and trapdoor, respectively.Note that in an dictionary, two professional sub-dictionariescan even contain a same keyword, but only the first appearedkeyword will be used to generate index and trapdoor, anotherwill be set to 0 in the vector. And the secret key K willbe formed as (S;M1;M2; |f1|;DID1 ; |f2|;DID2 ; · · · ), whereDIDi represents the identity of sub-dictionary and |fi| is thelength of fi. Other than these changes, the remaining phases(i.e., Index building, Trapdoor generating and Query) aresame as the Basic Framework.Dictionary Updating: In the searchable encryptionschemes with dictionary, dictionary update is a challengeproblem because it may cause to update massive indexesoutsourced to the cloud server. In general dictionary-basedsearch schemes, e.g., [13] and [14], the update of dictionarywill lead to re-generation of all indexes. In our FMSCSschemes, when it needs to change the sub-dictionaries or addnew sub-dictionaries, only the data owners who use the correspondingsub-dictionaries need to update their indexes, mostother data owners do not need to do any update operations.Such dictionary update operations are particularly lightweight.In addition, Li et al. [9] utilize the dimension expansiontechnique to implement the efficient dictionary expansion.Such method can also be included into our dictionary updatingprocess. And our scheme can even be more efficient than [9]since although [9] does not need to re-generate all indexes,but the corresponding extended operations on all indexes arenecessary. In comparison, our schemes only need to extendthe indexes of partial data owners.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing8________ ______ _________ ___________ _________ _______ ___ _______ _ ____ _________ ______________ _ _______ ___ ________ __ _ __ ______ _______ ___ __ _______ _______________ ___ __ __________ __________ ______ ___ ___ ___ _________ _! _” __________1 2 3 4Dictionary: || common || || computer science f f || f ||mathematics || f 1 2 3 4Dictionary: f ¢|| physics || f ¢ || mathematics || f ¢|| common || f ¢ _________ __________ __Fig. 4. Classified sub-dictionaries6 SECURITY ANALYSISIn this section, we analyze the main security properties ofthe proposed schemes. In particular, our analysis focuseson how the proposed schemes can achieve confidentialityof documents, privacy protection of index and trapdoor, andunlinkability of trapdoor. Other security features are not thefocus of our concern.6.1 Confidentiality of DocumentsIn our schemes, the outsourced documents are encrypted bythe traditional symmetric encryption algorithm (e.g., AES). Inaddition, the secret key sk is generated by the data owner andsent to the search user through a secure channel. Since theAES encryption algorithm is secure [23], any entity cannotrecover the encrypted documents without the secret key sk.Therefore, the confidentiality of encrypted documents can beachieved.6.2 Privacy Protection of Index and TrapdoorAs shown in Section 4.1, both the index Ij = (paM1; pbM2)and the trapdoor TfW = (M−11 qa;M−12 qb) are ciphertextsof vectors (P;Q). The secret key is K = (S;M1;M2) inthe FMS or (S;M1;M2; |f1|;DID1 ; |f2|;DID2 ; · · · ) in theFMSCS, where S functions as a splitting indicator whichsplits P and Q into (pa; pb) and (qa; qb), respectively, twoinvertible matrices M1 and M2 are used to encrypt (pa; pb)and (qa; qb). The security of this encryption algorithm has beenproved in the known ciphertext model [21]. Thus, the contentof index and trapdoor cannot be identified. Therefore, privacyprotection of index and trapdoor can be achieved.6.3 Unlinkability of TrapdoorTo protect the security of search, the unlinkability of trapdoorshould be achieved. Although the cloud server cannotdirectly recover the keywords, the linkability of trapdoor maycause leakage of privacy, e.g., the same keyword set may besearched many times, if the trapdoor generation function isdeterministic, even though the cloud server cannot decryptthe trapdoors, it can deduce the relationship of keywords. Weconsider whether the trapdoor TfW = (M−11 qa;M−12 qb) can belinked to the keywords. We prove our schemes can achieve theunlinkability of trapdoor in a strong threat model, i.e., knownbackground model [6].Known Background Model: In this model, the cloudserver can possess the statistical information from a knowncomparable dataset which bears the similar nature to thetargeting dataset.TABLE 1Structure of Q′Q′[1] _ _ _Q′[m] Q′[m + 1]FMS(CS) I _ _ _ 0 _ _ _ di _ _ _ 0 _ _ _ dj _ _ _ 􀀀sFMS(CS) II _ _ _ ak _ _ _ bh _ _ _ 0 _ _ _ ci _ _ _ 􀀀sAs shown in Table 1, in our FMS(CS) I, the trapdooris constituted by two parts. The values of all dimensionsdi(i = 1; 2; · · · ; l) are the super-increasing sequence randomlychosen by the search user (assume there are _ possiblesequences). And the (m+ 1) dimension is −s defined by thesearch user, where s is a positive random number. Assumethe size of −s is _s bits, there are 2_s possible values for−s. Further, to generate Q′′ = r · Q′, Q′ is multiplied by apositive random number r, there are 2_r possible values forr (if the search user chooses _r-bit r). Finally, Q′′ is splitto (qa; qb) according the splitting indicator S. Specifically, ifS[i] = 0(i = 1; 2; · · · ;m + 1), the value of Q′′[i] will berandomly split into qa[i] and qb[i], assume in S the numberof ‘0’ is _, and each dimension of qa and qb is _q bits. Notethat _s, _r, _ and _q are independent of each other. Thenin our FMS(CS) I, we can compute the probability that twotrapdoors are the same as follows:P1 =1_ · 2_s · 2_r · (2_q )_ =1_ · 2_s+_r+__q(16)Therefore, the larger _, _s, _r, _ and _q can achieve thestronger security, e.g., if we choose 1024-bit r, then the1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing9probability P1 < 1=21024. As a result, the probability thattwo trapdoors are the same is negligible.And in the FMS(CS) II, because −s = −Σl2h=1 bh,its value depends on the weight sequence(a1; a2; · · · ; al1 ; b1; b2; · · · ; bl2 ; c1; c2; · · · ; cl3 ). Assumethe number of different sequences is denoted as _, then wecan compute:P2 =1_ · 2_r · (2_q )_ =1_ · 2_r+__q(17)Similarly, in the FMS(CS) II and the FMS(CS) III, the probabilitythat two trapdoors are the same is negligible. Therefore,in our schemes, the unlinkability of trapdoor can be achieved.In summary, we present the comparison results of securitylevel in Table 2, where I and II represent FMS(CS) I andFMS(CS) II, respectively. It can be seen that all schemes canachieve confidentiality of documents and privacy protectionof index and trapdoor, but the OPE schemes [11], [25] cannotachieve the unlinkability of trapdoor very well because of thesimilarity relevance mentioned in [14].TABLE 2Comparison of Security Level[11], [25] [6], [13], [14] I IIConfidentialityp p p pPrivacy protectionp p p pUnlinkabilityp p pDiscussions: In MRSE [6], the values of P ·Q are equal tothe number of matching keywords, which suffers scale analysisattack when the cloud server is powerful and has knowledgeof some background information. To solve this problem, itextends the index and inserts a random number “j whichfollows a normal distribution and can confuse the values ofP ·Q. Thus, enhanced MRSE can resist scale analysis attack.However, the introduction of “j causes precision decrease ofthe returned results. There is a trade-off between precisionand security in MRSE. In comparison, our schemes do notsuffer the scale analysis attack. Because the values of P · Qin our schemes do not disclose any information due to therandomly selected sequences mentioned in Section 4.2 andSection 4.3. Therefore, our proposal can achieve the securitywithout sacrificing precision.7 PERFORMANCE EVALUATIONIn this section, we evaluate the performance of the proposedschemes using simulations, and compare the performance withthat of existing proposals in [6], [13], [14]. We apply a realworlddataset from the National Science Foundation ResearchAwards Abstracts 1990-2003 [24], in which we random selectmultiple documents and conduct real-world experiments on anIntel Core i5 3.2 GHz system.7.1 FunctionalityWe compare functionalities between [6], [13], [14] and ourschemes in Table 3, where I and II represent FMS(CS) I andFMS(CS) II, respectively.MRSE [6] can achieve multi-keyword search and coordinatematching using secure kNN computation scheme. And [13]and [14] considers the relevance scores of keywords. Comparedwith the other schemes, our FMS(CS) I considers boththe relevance scores and the preference factors of keywords.Note that if the search user sets all relevance scores andpreference factors of keywords as the same, the FMS(CS) Idegrades to MRSE and the coordinate matching can beachieved. And in the FMS(CS) II, if the search user sets allpreference factors of “OR” operation keywords as the same,the FMS(CS) II can also achieve the coordinate matchingof “OR” operation keywords. Particularly, the FMS(CS) IIachieves some fine-grained operations of keyword search,i.e., “AND”, “OR” and “NO” operations in Google Search,which are definitely practical and significantly enhance thefunctionalities of encrypted keyword search.TABLE 3Comparison of Functionalities[6] [13] [14] I IIMulti-keyword searchp p p p pCoordinate matchingp p p p pRelevance scorep p pPreference factorp pAND OR NO operationsp7.2 Query ComplexityIn the FMS(CS) II, we can implement “OR”, “AND” and“NO” operations by defining appropriate weights of keywords,this scheme provides a more fine-grained search than [6],[13] and [14]. If the keywords to perform “OR”, “AND” and“NO” operations are (w′1;w′2; · · · ;w′l1), (w′′1 ;w′′2 ; · · · ;w′′l2)and (w′′′1 ;w′′′2 ; · · · ;w′′′l3), respectively. Our FMS(CS) II cancomplete the search with only one query, however, in [6],[13] and [14], they would complete the search through thefollowing steps:• For the “OR” operation of l1 keywords, they need onlyone query Query(w′1;w′2; · · · ;w′l1) to return a collectionof documents with the most matching keywords (i.e.,coordinate matching), which can be denoted as X =Query(w′1;w′2; · · · ;w′l1).• For the “AND” operation of l2 keywords, [6], [13]and [14] cannot generate a query for multiple keywordsto achieve the “AND” operation. Therefore, aftercosting l2 queries Query(w′′i )(i = 1; 2; · · · ; l2),they can do the “AND” operation, and the correspondingdocument set can be denoted as Y =Query(w′′1 )∩Query(w′′2 )∩· · ·∩Query(w′′l2).• For the “NO” operation of l3 keywords, they need l3queries Query(w′′′i )(i = 1; 2; · · · ; l3), firstly. Then, thedocument set of the “NO” operation can be denoted asZ = Query(w′′′1 )∩Query(w′′′2 )∩· · ·∩Query(w′′′l3).• Finally, the document collection achieved “OR”, “AND”and “NO” operations can be represented as X∩Y∩Z.As shown in Fig. 5a, 5b and 5c, to achieve these operations,the FMS(CS) II can outperform the existing proposals withless queries generated.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing102468102468100510152025Number Number of “NO” keywords of “AND” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(a)24681024681005101520Number of “NO” keywords Number of “OR” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(b)24681024681005101520Number of “OR” keywords Number of “AND” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(c)Fig. 5. Time for building index. (a) Number of queriesfor the different number of “AND” and “NO” keywordswith the same number of “OR” keywords, l1 = 5. (b)Number of queries for the different number of “OR” and“NO” keywords with the same number of “AND” keywords,l2 = 5. (c) Number of queries for the different number of“AND” and “OR” keywords with the same number of “NO”keywords, l3 = 5.7.3 Efficiency7.3.1 Computation overheadIn order to easily demonstrate our scheme computation overhead,we analysis our scheme from each phase.Index building. Note that the Index building phase of [6]is the same as our FMS II scheme, without calculating therelevance score. And the Index building phase of the FMS Iis the same as [13], containing the relevance score computing.Compared with the FMS I, the FMS II does not need to calculatethe relevance score. And compared with the computationcost of building index, the cost of calculating the relevancescore is negligible, we do not distinguish them. Moreover,in our enhanced schemes (FMSCS), we divide the totaldictionary into 1 common sub-dictionary and 20 professionalsub-dictionaries (assume each data owner averagely chooses 1common sub-dictionary and 3 professional sub-dictionaries togenerate the index). As shown in Fig. 6, we can see the time forbuilding index is dominated by both the size of dictionary andthe number of documents. And compared with [6], [13], [14]and our FMS schemes, the FMSCS schemes largely reducethe computation overhead.Trapdoor generating. In Trapdoor generating phase, [6]and [13] firstly creates a vector according to the searchkeyword set fW, then encrypts the vector by the secure kNNcomputation scheme. And [14] also generates a vector anduses homomorphic encryption to encrypt each dimension. Incomparison, our FMS I and FMS II schemes should firstly2000 4000 6000 8000 100000500100015002000Size of dictionaryTime (s)[6] & [13] & FMS[14]FMSCS(a)2000 4000 6000 8000 100000100200300400500600Number of documentsTime (s)[6] & [13] & FMS[14]FMSCS(b)Fig. 6. Time for building index. (a) For the different size ofdictionary with the same number of documents, N=6000.(b) For the different number of documents with the samesize of dictionary, |W| = 4000.2000 4000 6000 8000 10000020040060080010001200Size of dictionaryTime (ms)[6] & [13] & FMS[14]FMSCS(a)10 20 30 40 50050100150200250300Number of query keywordsTime (ms)[6] & [13] & FMS[14]FMSCS(b)Fig. 7. Time for generating trapdoor. (a) For the differentsize of dictionary with the same number of query keywords,|fW|=20. (b) For the different number of query keywordswith the same size of dictionary, |W| = 4000.generate a super-increasing sequence and a weight sequence,respectively. But actually, we can pre-select a correspondingsequence for each scheme, it can also achieve search processand privacy. Because even if the vectors are the same formultiple queries, the trapdoors will be not the same due tothe security of kNN computation scheme. Therefore, the computationcost of [6], [13] and all FMS schemes in Trapdoorgenerating phase are the same. As shown in Fig. 7, the timefor generating trapdoor is dominated by the size of dictionary,instead of the number of query keywords. Hence, our FMSCSschemes are also very efficient in Trapdoor generating phase.Query. As [6], [13] and the FMS all adopt the secure kNNcomputation scheme, the time for query is the same. Thecomputation overhead in Query phase, as shown in Fig. 8,is greatly affected by the size of dictionary and the numberof documents, and almost has no relation to the number ofquery keywords. Further we can see, our FMSCS schemessignificantly reduce the computation cost in Query phase.As [14] needs to encrypt each dimension of index/trapdoorusing full homomorphic encryption, its index/trapdoor size isenormous. Note that, in Trapdoor generating and Queryphases, the computation overheads are not affected by thenumber of query keywords. Thus our FMS and FMSCSschemes are more efficient compared with some multiplekeywordsearch schemes [26], [27], as their cost is linear withthe number of query keywords.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing112000 4000 6000 8000 10000010203040506070Size of dictionaryTime (s)[6] & [13] & FMS[14]FMSSCS(a)2000 4000 6000 8000 1000001020304050Number of documentsTime (s)[6] & [13] & FMS[14]FMSCS(b)10 20 30 40 5005101520253035Number of query keywordsTime (s)[6] & [13] & FMS[14]FMSCS(c)Fig. 8. Time for query. (a) For the different size ofdictionary with the same number of documents and numberof search keywords, N = 6000; |fW| = 20. (b) Forthe different number of documents with the same sizeof dictionary and number of search keywords, |W| =4000; |fW| = 20. (c) For the different number of searchkeyword with the same size of dictionary and number ofdocuments, N = 6000; |W| = 4000.7.3.2 Storage overheadAs shown in Table 4, we provide a comparison of storageoverhead among several schemes. Specifically, we evaluate thestorage overhead from three parts: the data owner, the searchuser and the cloud server.According to Table 4, in the FMS, the FMSCS as well asschemes of [6] and [13], the storage overhead of the dataowner are the same. In these schemes, the data owner preservesher secret key K = (S;M1;M2) and symmetric key sklocally, where S is an (m+1)-dimensional vector, M1 and M2are (m+1)×(m+1) invertible matrices. All elements in S,M1and M2 are the float number. Since the size of a float numberis 4 bytes, the size of K is 4· (m+1)+8· (m + 1)2 bytes. Weassume that the size of sk is Ssk that is a constant. Thus, thetotal size of storage overhead is 4·(m+1)+8·(m + 1)2+Sskbytes. However, in [14], the storage overhead of data owneris _5=8 bytes, where the _ is the secure parameter. Thestorage overhead is 4GB when we choose _ = 128, which ispopular in a full homomorphic encryption scheme. However,the storage overhead of the FMS and the FMSCS are almost763MB when we choose m = 10000, which is large enoughfor a search scheme. Therefore, the FMS and the FMSCS aremore efficient than scheme in [14] in terms of the storageoverhead of the data owner.As shown in Table 4, a search user in the FMS, the FMSCSas well as the schemes of [6] and [13] preserves the secret keyK = (S;M1;M2) and the symmetric key sk locally. Therefore,the total storage overhead is 4(m+1)+8(m + 1)2+Sskbytes. However, in [14], the storage overhead is _5=8 + _2=8bytes. The storage overhead is 4GB when we choose _ = 128,which is popular in a full homomorphic encryption scheme.However, the storage overhead of the FMS and the FMSCSare almost 763MB when we choose m = 10000, which islarge enough for a search scheme. Therefore, the FMS andthe FMSCS are more efficient than scheme in [14] in termsof the storage overhead of the search user.The cloud server preserves the encrypted documents and theindexes. The size of encrypted documents in all schemes arethe same, i.e., N·Ds. For the indexes, in the FMS and schemesin [6] and [13], the storage overhead are 8 · (m+1) ·N bytes.In the FMSCS, the storage overhead is 8·“· (m+1) ·N bytes,where 0 < ” < 1. When m = 1000 and N = 10000 whichare large enough for a search scheme, the storage overhead ofindexes is about 132MB in the FMSCS. And in schemes of [6]and [13] as well as the FMS, the size of indexes is 760MB withthe same conditions. In scheme in [14], the storage overheadof indexes is N · Ds + m · N · (_=8)5 bytes, it is 4GB whenwe choose _ = 128, which is popular in a full homomorphicencryption scheme. Therefore, the FMS and the FMSCS aremore efficient than scheme in [14] in terms of the storageoverhead of the cloud server.7.3.3 Communication overheadAs shown in Table 5, we provide a comparison of communicationoverhead among several schemes. Specifically,we consider the communication overhead from three parts:the communication between the data owner and the cloudserver (abbreviated as D-C), the communication between thesearch user and the cloud server (abbreviated as C-S) and thecommunication between the data owner and the search user(abbreviated as D-S).D-C. In the FMS as well as schemes of [6] and [13], the dataowner needs to send information to cloud server in the formof Cj ||FIDj ||Ij (j = 1; 2; · · · ;N), where the Cj representsthe encrypted documents, FIDj represents the identity of thedocument and Ij represents the index. We assume that theaverage size of documents is Ds, thus the size of documentsis N ·Ds. We assume the encrypted documents identity FIDis a 10-byte string. Thus, the total size of the identity FIDis 10 · N bytes. The index Ij = (paM1; pbM2) contains two(m+1)-dimensional vectors. Each dimension is a float number(the size of each float is 4 bytes). Thus, the total size of index is8·(m+1)·N bytes. Therefore, the total size of communicationoverhead is 8·(m+1)·N+10·N+N·Ds bytes. In the FMSCS,the total size of communication overhead is 8·“· (m+1)·N +10·N+N·Ds bytes. If we choose the ” as 0:2, the size of indexis 1:6 · (m+1) ·N bytes, and the total size of communicationof FMSCS is 1:6·(m+1)·N+10·N+Ds ·N bytes. However,in [14], the communication overhead is N ·Ds +m·N · _5=8bytes, where _ is the secure parameter. If we choose _ = 128which is popular in a full homomorphic encryption schemeand m = 1000 and N = 10000 which are large enough fora search scheme, the FMS and the FMSCS are more efficientthan scheme in [14] in terms of the communication overheadof D-C.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing12TABLE 4Comparison of Storage Overhead (Bytes). (m represents the size of dictionary; N represents the number ofdocuments; Ds represents the average size of each encrypted document; _ represents the secure parameter; “represents the decrease rate of dictionary by using our classified sub-dictionaries technology; Ssk represents the sizeof symmetric key.)[14] [6], [13] and FMS FMSCSData Owner _5=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskSearch User _5=8 + _2=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskCloud Server N _ Ds + m _ N _ _5=8 N _ Ds + 8 _ (m + 1) _ N N _ Ds + 8 _ ” _ (m + 1) _ NTABLE 5Comparison of Communication Overhead (Bytes). (m represents the size of dictionary; N represents the number ofdocuments; Ds represents the average size of each encrypted document; T represents the number of returneddocuments; _ represents the secure parameter; ” represents the decrease rate of dictionary by using our classifiedsub-dictionaries technology; Ssk represents the size of symmetric key.)[14] [6], [13] and FMS FMSCSD-C N _ Ds + m _ N _ _5=8 8 _ (m + 1) _ N + 10 _ N + N _ Ds 8 _ ” _ (m + 1) _ N + 10 _ N + N _ DsC-S m _ _5=8 + T _ Ds 8 _ (m + 1) + T _ Ds 8 _ ” _ (m + 1) + T _ DsD-S _5=8 + _2=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskC-S. The C-S consists of two phases: Query and Resultsreturning. In the Query phase, a search user in the FMS as wellas the schemes in [6] and [13] sends the trapdoor to the cloudserver in the form of TfW = (M−11 qa;M−12 qb), which containstwo (m+1)-dimensional vectors. Thus, the communicationoverhead is 8·(m+1) bytes. In the FMSCS, the communicationoverhead is 8 · ” · (m + 1)(0 < ” < 1) bytes. In the Resultsreturning phase, the cloud server sends the correspondingresult to the search user. The communication overhead of CSincreases along with the number of returned documentsat this point. We assume that the number of the returneddocuments is T, thus, the total communication overhead ofcloud server to search user is T · Ds bytes. Therefore, thetotal communication overhead of C-S is 8 ·m+T ·Ds bytes.In the FMS as well as the schemes in [6] and [13], the totalcommunication overhead of C-S is 8 · ” · (m + 1) + T · Dsbytes. In [14], the total communication overhead of C-S ism·_5=8+T ·Ds bytes. If we choose _ = 128 which is popularin a full homomorphic encryption scheme and m = 1000 andN = 10000 which are large enough for a search scheme, theFMS and the FMSCS are more efficient than scheme in [14]in terms of the communication overhead of C-S.D-S. From table 5, we can see that the communicationoverhead of the FMS, the FMSCS as well as schemes in[6] and [13] are the same. In the Initialization phase, thedata owner sends the secret key K = (S;M1;M2) andsymmetric key sk to the search user, where S is an (m+ 1)-dimensional vector, M1 and M2 are (m + 1) × (m + 1)invertible matrices. Thus, the size of the secret key K is4 · (m + 1) + 8 · (m + 1)2 bytes. Therefore, the total sizeof communication overhead is 4 · (m+1)+8 · (m + 1)2+Sskbytes, where the Ssk represents the size of symmetric key.However, the communication overhead of scheme in [14] is_5=8+_2=8 bytes. The communication overhead is 4GB whenwe choose _ = 128, which is popular in a full homomorphicencryption scheme. However, the communication overhead ofthe FMS and the FMSCS are almost 763MB when we choosem = 10000, which is large enough for a search scheme.Therefore, the FMS and the FMSCS are more efficient thanscheme in [14] in terms of the communication overhead ofD-S.8 RELATED WORKThere are mainly two types of searchable encryption in literature,Searchable Public-key Encryption (SPE) and SearchableSymmetric Encryption (SSE).8.1 SPESPE is first proposed by Boneh et al. [28], which supportssingle keyword search on encrypted data but the computationoverhead is heavy. In the framework of SPE, Boneh et al. [27]propose conjunctive, subset, and range queries on encrypteddata. Hwang et al. [29] propose a conjunctive keyword schemewhich supports multi-keyword search. Zhang et al. [17] proposean efficient public key encryption with conjunctivesubsetkeywords search. However, these conjunctive keywordsschemes can only return the results which match all thekeywords simultaneously, and cannot rank the returned results.Qin et al. [30] propose a ranked query scheme which usesa mask matrix to achieve cost-effectiveness. Yu et al. [14]propose a multi-keyword top-k retrieval scheme with fullyhomomorphic encryption, which can return ranked results andachieve high security. In general, although SPE allows moreexpressive queries than SSE [13], it is less efficient, andtherefore we adopt SPE in the work.8.2 SSEThe concept of SSE is first developed by Song et al. [8].Wang et al. [25] develop the ranked keyword search scheme,which considers the relevance score of a keyword. However,the above schemes cannot efficiently support multi-keywordsearch which is widely used to provide the better experience1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing13to the search user. Later, Sun et al. [13] propose a multikeywordsearch scheme which considers the relevance scoresof keywords, and it can achieve efficient query by utilizingthe multidimensional tree technique. A widely adopted multikeywordsearch approach is multi-keyword ranked search(MRSE) [6]. This approach can return the ranked results ofsearching according to the number of matching keywords. Liet al. [10] utilize the relevance score and k-nearest neighbortechniques to develop an efficient multi-keyword searchscheme that can return the ranked search results based on theaccuracy. Within this framework, they leverage an efficientindex to further improve the search efficiency, and adopt theblind storage system to conceal access pattern of the searchuser. Li et al. [19] also propose an authorized and ranked multikeywordsearch scheme (ARMS) over encrypted cloud databy leveraging the ciphertext policy attribute-based encryption(CP-ABE) and SSE techniques. Security analysis demonstratesthat the proposed ARMS scheme can achieve collusion resistance.In this paper, we propose FMS(CS) schemes which notonly support multi-keyword search over encrypted data, butalso achieve the fine-grained keyword search with the functionto investigate the relevance scores and the preference factors ofkeywords and, more importantly, the logical rule of keywords.In addition, with the classified sub-dictionaries, our proposalis efficient in terms of index building, trapdoor generating andquery.9 CONCLUSIONIn this paper, we have investigated on the fine-grained multikeywordsearch (FMS) issue over encrypted cloud data, andproposed two FMS schemes. The FMS I includes both therelevance scores and the preference factors of keywords toenhance more precise search and better users’ experience,respectively. The FMS II achieves secure and efficient searchwith practical functionality, i.e., “AND”, “OR” and “NO”operations of keywords. Furthermore, we have proposed theenhanced schemes supporting classified sub-dictionaries (FMSCS)to improve efficiency.For the future work, we intend to further extend the proposalto consider the extensibility of the file set and the multi-usercloud environments. Towards this direction, we have madesome preliminary results on the extensibility [5] and the multiusercloud environments [19]. Another interesting topic is todevelop the highly scalable searchable encryption to enableefficient search on large practical databases.ACKNOWLEDGMENTThis work is supported by the National Natural ScienceFoundation of China under Grants 61472065, 61350110238,61103207, U1233108, U1333127, and 61272525, the InternationalScience and Technology Cooperation and ExchangeProgram of Sichuan Province, China under Grant 2014HH0029,China Postdoctoral Science Foundation funded projectunder Grant 2014M552336, and State Key Laboratory ofInformation Security Open Foundation under Grant 2015-MS-02.REFERENCES[1] H. Liang, L. X. Cai, D. Huang, X. Shen, and D. Peng, “An smdpbasedservice model for interdomain resource allocation in mobile cloudnetworks,” IEEE Transactions on Vehicular Technology, vol. 61, no. 5,pp. 2222–2232, 2012.[2] M. M. Mahmoud and X. Shen, “A cloud-based scheme for protectingsource-location privacy against hotspot-locating attack in wireless sensornetworks,” IEEE Transactions on Parallel and Distributed Systems,vol. 23, no. 10, pp. 1805–1818, 2012.[3] Q. Shen, X. Liang, X. Shen, X. Lin, and H. Luo, “Exploiting geodistributedclouds for e-health monitoring system with minimum servicedelay and privacy preservation,” IEEE Journal of Biomedical and HealthInformatics, vol. 18, no. 2, pp. 430–439, 2014.[4] T. Jung, X. Mao, X. Li, S.-J. Tang, W. Gong, and L. Zhang, “Privacypreservingdata aggregation without secure channel: multivariate polynomialevaluation,” in Proceedings of INFOCOM. IEEE, 2013, pp.2634–2642.[5] Y. Yang, H. Li, W. Liu, H. Yang, and M. Wen, “Secure dynamicsearchable symmetric encryption with constant document update cost,”in Proceedings of GLOBCOM. IEEE, 2014, to appear.[6] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multikeywordranked search over encrypted cloud data,” IEEE Transactionson Parallel and Distributed Systems, vol. 25, no. 1, pp. 222–233, 2014.[7] https://support.google.com/websearch/answer/173733?hl=en.[8] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searcheson encrypted data,” in Proceedings of S&P. IEEE, 2000, pp. 44–55.[9] R. Li, Z. Xu, W. Kang, K. C. Yow, and C.-Z. Xu, “Efficient multikeywordranked query over encrypted data in cloud computing,” FutureGeneration Computer Systems, vol. 30, pp. 179–190, 2014.[10] H. Li, D. Liu, Y. Dai, T. H. Luan, and X. Shen, “Enabling efficientmulti-keyword ranked search over encrypted cloud data through blindstorage,” IEEE Transactions on Emerging Topics in Computing, 2014,DOI10.1109/TETC.2014.2371239.[11] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keywordsearch over encrypted cloud data,” in Proceedings of ICDCS. IEEE,2010, pp. 253–262.[12] A. Boldyreva, N. Chenette, Y. Lee, and A. Oneill, “Order-preservingsymmetric encryption,” in Advances in Cryptology-EUROCRYPT.Springer, 2009, pp. 224–241.[13] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li,“Verifiable privacy-preserving multi-keyword text search in the cloudsupporting similarity-based ranking,” IEEE Transactions on Parallel andDistributed Systems, vol. DOI: 10.1109/TPDS.2013.282, 2013.[14] J. Yu, P. Lu, Y. Zhu, G. Xue, and M. Li, “Towards secure multikeywordtop-k retrieval over encrypted cloud data,” IEEE Transactionson Dependable and Secure Computing, vol. 10, no. 4, pp. 239–250,2013.[15] A. Arvanitis and G. Koutrika, “Towards preference-aware relationaldatabases,” in International Conference on Data Engineering (ICDE).IEEE, 2012, pp. 426–437.[16] G. Koutrika, E. Pitoura, and K. Stefanidis, “Preference-based querypersonalization,” in Advanced Query Processing. Springer, 2013, pp.57–81.[17] B. Zhang and F. Zhang, “An efficient public key encryption withconjunctive-subset keywords search,” Journal of Network and ComputerApplications, vol. 34, no. 1, pp. 262–267, 2011.[18] D. Stinson, Cryptography: theory and practice. CRC press, 2006.[19] H. Li, D. Liu, K. Jia, and X. Lin, “Achieving authorized and rankedmulti-keyword search over encrypted cloud data,” in Proceedings of ICC.IEEE, 2015, to appear.[20] S. Zerr, E. Demidova, D. Olmedilla, W. Nejdl, M. Winslett, andS. Mitra, “Zerber: r-confidential indexing for distributed documents,” inProceedings of the 11th international conference on Extending databasetechnology: Advances in database technology. ACM, 2008, pp. 287–298.[21] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis, “Secureknn computation on encrypted databases,” in Proceedings of SIGMODInternational Conference on Management of data. ACM, 2009, pp.139–152.[22] J. Zobel and A. Moffat, “Exploring the similarity space,” in ACM SIGIRForum, vol. 32, no. 1. ACM, 1998, pp. 18–34.[23] N. Ferguson, R. Schroeppel, and D. Whiting, “A simple algebraicrepresentation of rijndael,” in Selected Areas in Cryptography. Springer,2001, pp. 103–111.[24] http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing14[25] C. Wang, N. Cao, K. Ren, and W. Lou, “Enabling secure and efficientranked keyword search over outsourced cloud data,” IEEE Transactionson Parallel and Distributed Systems, vol. 23, no. 8, pp. 1467–1479,2012.[26] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword searchover encrypted data,” in Applied Cryptography and Network Security.Springer, 2004, pp. 31–45.[27] D. Boneh and B. Waters, “Conjunctive, subset, and range queries onencrypted data,” in Theory of cryptography. Springer, 2007, pp. 535–554.[28] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public keyencryption with keyword search,” in Advances in Cryptology–Eurocrypt.Springer, 2004, pp. 506–522.[29] Y. Hwang and P. Lee, “Zpublic key encryption with conjunctive keywordsearch and its extension to a multi-user system,” in Proceeding ofPairing. Springer, 2007, pp. 2–22.[30] Q. Liu, C. C. Tan, J. Wu, and G. Wang, “Efficient information retrievalfor ranked queries in cost-effective cloud environments,” in Proceedingsof INFOCOM. IEEE, 2012, pp. 2581–2585.Hongwei Li is an Associate Professor with theSchool of Computer Science and Engineering,University of Electronic Science and Technologyof China, China. He received the PhD degreein computer software and theory from Universityof Electronic Science and Technology of China,China in 2008. He has worked as a Post-Doctoral Fellow in Department of Electrical andComputer Engineering at University of Waterloofor one year until Oct. 2012. His research interestsinclude network security, applied cryptography,and trusted computing. Dr. Li serves as the Associate Editor ofPeer-to-Peer Networking and Applications, the Guest Editor for Peerto-Peer Networking and Applications Special Issue on Security andPrivacy of P2P Networks in Emerging Smart City. He also serves onthe technical program committees for many international conferencessuch as IEEE INFOCOM, IEEE ICC, IEEE GLOBECOM, IEEE WCNC,IEEE SmartGridComm, BODYNETS and IEEE DASC. He is a memberof IEEE, a member of China Computer Federation and a member ofChina Association for Cryptologic Research.Yi Yang received his B.S. degree in NetworkEngineering from Tianjin University of Scienceand Technology (TUST) in 2012. Currently, he isa master student at the School of Computer Scienceand Engineering, University of ElectronicScience and Technology of China (UESTC), China.He serves as the reviewer of Peer-to-PeerNetworking and Application, IEEE INFOCOM,IEEE ICC, IEEE GLOBECOM, IEEE ICCC, etc.His research interests include cryptography, andthe secure smart grid.Tom H. Luan received the B.Sc. degree fromXian Jiaotong University, China, in 2004, theM.Phil. degree from Hong Kong University ofScience and Technology, Hong Kong, China, in2007, and Ph.D. degrees from the Universityof Waterloo, Canada, in 2012. Since December2013, he has been the Lecturer in Mobile andApps at the School of Information Technology,Deakin University, Melbourne Burwood, Australia.His research mainly focuses on vehicularnetworking, wireless content distribution, peerto-peer networking and mobile cloud computing.Xiaohui Liang received the B.Sc. degree inComputer Science and Engineering and theM.Sc. degree in Computer Software and Theoryfrom Shanghai Jiao Tong University (SJTU), China,in 2006 and 2009, respectively. He is currentlyworking toward a Ph.D. degree in the Departmentof Electrical and Computer Engineering,University of Waterloo, Canada. His research interestsinclude applied cryptography, and securityand privacy issues for e-healthcare system,cloud computing, mobile social networks, andsmart grid.Liang Zhou is a professor with the NationalKey Lab of Science and Technology on Communicationat University of Electronic Scienceand Technology of China, China. His currentresearch interests include error control coding,secure communication and cryptography.Xuemin (Sherman) Shen is a Professor andUniversity Research Chair, Department of Electricaland Computer Engineering, University ofWaterloo, Canada. He was the Associate Chairfor Graduate Studies from 2004 to 2008. Dr.Shen’s research focuses on resource managementin interconnected wireless/wired networks,wireless network security, vehicular ad hocand sensor networks. Dr. Shen served as theTechnical Program Committee Chair for IEEEVTC’10 Fall and IEEE Globecom’07. He alsoserves/served as the Editor-in-Chief for IEEE Network, Peer-to-PeerNetworking and Application, and IET Communications; a Founding AreaEditor for IEEE Transactions on Wireless Communications; an AssociateEditor for IEEE Transactions on Vehicular Technology, ComputerNetworks. Dr. Shen is a registered Professional Engineer of Ontario,Canada, an IEEE Fellow, an Engineering Institute of Canada Fellow, aCanadian Academy of Engineering Fellow, and a Distinguished Lecturerof IEEE Vehicular Technology Society and Communications Society.