Maximizing P2P File Access Availability in Mobile Ad Hoc Networks though Replication for Efficient Fi

File sharing applications in mobile ad hoc networks (MANETs) have attracted more and more attention in recent years. The efficiency of file querying suffers from the distinctive properties of such networks including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica creation with minimum average querying delay.

Specifically, current file replication protocols in mobile ad hoc networks have two shortcomings. First, they lack a rule to allocate limited resources to different files in order to minimize the average querying delay. Second, they simply consider storage as available resources for replicas, but neglect the fact that the file holders’ frequency of meeting other nodes also plays an important role in determining file availability. Actually, a node that has a higher meeting frequency with others provides higher availability to its files. This becomes even more evident in sparsely distributed MANETs, in which nodes meet disruptively.

In this paper, we introduce a new concept of resource for file replication, which considers both node storage and node meeting ability. We theoretically study the influence of resource allocation on the average querying delay and derive an optimal file replication rule (OFRR) that allocates resources to each file based on its popularity and size. We then propose a file replication protocol based on the rule, which approximates the minimum global querying delay in a fully distributed manner. Our experiment and simulation results show the superior performance of the proposed protocol in comparison with other representative replication protocols.

1.2 INTRODUCTION

With the increasing popularity of mobile devices, e.g., smartphones and laptops, we envision the future of MANETs consisted of these mobile devices. By MANETs, we refer to both normal MANETs and disconnected MANETs, also known as delay tolerant networks (DTNs). The former has a relatively dense node distribution in an area while the latter has sparsely distributed nodes that meet each other opportunistically. On the other side, the emerging of mobile file sharing applications on the peer-to-peer (P2P) file sharing over such MANETs. The local P2P file sharing model provides three advantages. First, it enables file sharing when no base stations are available (e.g., in rural areas). Second, with the P2P architecture, the bottleneck on overloaded servers in current clientserver based file sharing systems can be avoided. Third, it exploits otherwise wasted peer to peer communication opportunities among mobile nodes. As a result, nodes can freely and unobtrusively access and share files in the distributed MANET environment, which can possibly support interesting applications.

For example, mobile nodes can share files based on users’ proximity in the same building or in a local community. Tourists can share their travel experiences or emergency information with other tourists through digital devices directly even when no base station is available in remote areas. Drivers can share road information through the vehicle-to-vehicle communication. However, the distinctive properties of MANETs, i.e., node mobility, limited communication range and resource, have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range. Broadcasting can quickly discover files, but it leads to the broadcast storm problem with high energy consumption.

Probabilistic routing and file discovery protocols avoid broadcasting by forwarding a query to a node with higher probability of meeting the destination. But the opportunistic encountering of nodes in MANETs makes file searching and retrieval non-deterministic. File replication is an effective way to enhance file availability and reduce file querying delay. It creates replicas for a file to improve its probability of being encountered by requests. Unfortunately, it is impractical and inefficient to enable every node to hold the replicas of all files in the system considering limited node resources. Also, file querying delay is always a main concern in a file sharing system. Users often desire to receive their requested files quickly no matter whether the files are popular or not. Thus, a critical issue is raised for further investigation: how to allocate the limited resource in the network to different files for replication so that the overall average file querying delay is minimized? Recently, a number of file replication protocols have been proposed for MANETs. In these protocols, each individual node replicates files it frequently queries or a group of nodes create one replica for each file they frequently query. In the former, redundant replicas are easily created in the system, thereby wasting resources.

In the latter, though redundant replicas are reduced by group based cooperation, neighboring nodes may separate from each other due to node mobility, leading to large query delay. There are also some works addressing content caching in disconnected MANETs/ DTNs for efficient data retrieval or message routing. They basically cache data that are frequently queried on places that are visited frequently by mobile nodes. Both the two categories of replication methods fail to thoroughly consider that a node’s mobility affects the availability of its files. In spite of efforts, current file replication protocols lack a rule to allocate limited resources to files for replica creation in order to achieve the minimum average querying delay, i.e., global search efficiency optimization under limited resources. They simply consider storage as the resource for replicas, but neglect that a node’s frequency to meet other nodes (meeting ability in short) also influences the availability of its files. Files in a node with a higher meeting ability have higher availability.

1.3 LITRATURE SURVEY

CONTACT DURATION AWARE DATA REPLICATION IN DELAY TOLERANT NETWORKS

AUTHOR: X. Zhuo, Q. Li, W. Gao, G. Cao, and Y. Dai

PUBLISH: Proc. IEEE 19th Int’l Conf. Network Protocols (ICNP), 2011.

EXPLANATION:

The recent popularization of hand-held mobile devices, such as smartphones, enables the inter-connectivity among mobile users without the support of Internet infrastructure. When mobile users move and contact each other opportunistically, they form a Delay Tolerant Network (DTN), which can be exploited to share data among them. Data replication is one of the common techniques for such data sharing. However, the unstable network topology and limited contact duration in DTNs make it difficult to directly apply traditional data replication schemes. Although there are a few existing studies on data replication in DTNs, they generally ignore the contact duration limits. In this paper, we recognize the deficiency of existing data replication schemes which treat the complete data item as the replication unit, and propose to replicate data at the packet level. We analytically formulate the contact duration aware data replication problem and give a centralized solution to better utilize the limited storage buffers and the contact opportunities. We further propose a practical contact Duration Aware Replication Algorithm (DARA) which operates in a fully distributed manner and reduces the computational complexity. Extensive simulations on both synthetic and realistic traces show that our distributed scheme achieves close-to-optimal performance, and outperforms other existing replication schemes.

SOCIAL-BASED COOPERATIVE CACHING IN DTNS: A CONTACT DURATION AWARE APPROACH

AUTHOR: X. Zhuo, Q. Li, G. Cao, Y. Dai, B.K. Szymanski, and T.L. Porta,

PUBLISH: Proc. IEEE Eighth Int’l Conf. Mobile Adhoc and Sensor Systems (MASS), 2011.

EXPLANATION:

Data access is an important issue in Delay Tolerant Networks (DTNs), and a common technique to improve the performance of data access is cooperative caching. However, due to the unpredictable node mobility in DTNs, traditional caching schemes cannot be directly applied. In this paper, we propose DAC, a novel caching protocol adaptive to the challenging environment of DTNs. Specifically, we exploit the social community structure to combat the unstable network topology in DTNs. We propose a new centrality metric to evaluate the caching capability of each node within a community, and solutions based on this metric are proposed to determine where to cache. More importantly, we consider the impact of the contact duration limitation on cooperative caching, which has been ignored by the existing works. We prove that the marginal caching benefit that a node can provide diminishes when more data is cached. We derive an adaptive caching bound for each mobile node according to its specific contact patterns with others, to limit the amount of data it caches. In this way, both the storage space and the contact opportunities are better utilized. To mitigate the coupon collector’s problem, network coding techniques are used to further improve the caching efficiency. Extensive trace-driven simulations show that our cooperative caching protocol can significantly improve the performance of data access in DTNs.

SEDUM: EXPLOITING SOCIAL NETWORKS IN UTILITY-BASED DISTRIBUTED ROUTING FOR DTNS

AUTHOR: Z. Li and H. Shen

PUBLISH: IEEE Trans. Computers, vol. 62, no. 1, pp. 83-97, Jan. 2012.

EXPLANATION:

However, current probabilistic forwarding methods only consider node contact frequency in calculating the utility while neglecting the influence of contact duration on the throughput, though both contact frequency and contact duration reflect the node movement pattern in a social network. In this paper, we theoretically prove that considering both factors leads to higher throughput than considering only contact frequency. To fully exploit a social network for high throughput and low routing delay, we propose a Social network oriented and duration utility-based distributed multicopy routing protocol (SEDUM) for DTNs. SEDUM is distinguished by three features. First, it considers both contact frequency and duration in node movement patterns of social networks. Second, it uses multicopy routing and can discover the minimum number of copies of a message to achieve a desired routing delay. Third, it has an effective buffer management mechanism to increase throughput and decrease routing delay. Theoretical analysis and simulation results show that SEDUM provides high throughput and low routing delay compared to existing routing approaches. The results conform to our expectation that considering both contact frequency and duration for delivery utility in routing can achieve higher throughput than considering only contact frequency, especially in a highly dynamic environment with large routing messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

This work focuses on Delay Tolerant Networks (DTNs) in a social network environment. DTNs do not have a complete path from a source to a destination most of the time. Previous data routing approaches in DTNs are primarily based on either flooding or single-copy routing. However, these methods incur either high overhead due to excessive transmissions or long delays due to suboptimal choices for relay nodes. Probabilistic forwarding that forwards a message to a node with a higher delivery utility enhances single-copy routing.

Previous file sharing applications in mobile ad hoc networks (MANETs) have attracted more efficiency of file querying suffers from the distinctive properties of MANETs including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica sharing with minimum average querying delay communication links between mobile nodes are transient and network maintenance overhead is a major performance bottleneck for data transmission. Low node density makes it difficult to establish end-to-end connection, thus impeding a continuous end-to-end path between a source and a destination.

DTN networks for communication in outer space, but is now directly accessible from our pockets both the characteristics of MANETs and the requirements of P2P file sharing an application layer overlay network. We port a DTN type solution into an infrastructure-less environment like MANETs and leverage peer mobility to reach data in other disconnected networks. This is done by implementing an asynchronous communication model, store-delegate-and-forward, like DTNs, where a peer can delegate unaccomplished file download or query tasks to special peers. To improve data transmission performance while reducing communication overhead, we select these special peers by the expectation of encountering them again in future and assign them different download starting point on the file.

2.1.1 DISADVANTAGES:

  • Limited communication range and resource have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range.
  • The disadvantage is that it lacked of transparency. Receiving a URL explicitly points to certain data replica and that the browser will become aware of the switching between the different machines.
  • And for scalability, the necessity of making contact with is always the same, the single service machine can make it bottleneck as the number of clients increase which makes situation worse.


2.2 PROPOSED SYSTEM:

We propose a distributed file replication protocol that can approximately realize the optimal file replication rule with the two mobility models in a distributed manner in the OFRR in the two mobility models (i.e., Equations (22) and (28)) have the same form, we present the protocol in this section without indicating the specific mobility model. We first introduce the challenges to realize the OFRR and our solutions. We then propose a replication protocol to realize OFRR and analyze the effect of the protocol.

We propose the priority competition and split file replication protocol (PCS). We first introduce how a node retrieves the parameters needed in PCS and then present the detail of PCS. we briefly prove the effectiveness of PCS. We refer to the process in which a node tries to copy a file to its neighbors as one round of replica distribution. Recall that when a replica is created for a file with P, the two copies will replicate files with priority P =2 in the next round. This means that the creation of replicas will not increase the overall P of the file. Also, after each round, the priority value of each file or replica is updated based on the received requests for the file.

Then, though some replicas may be deleted in the competition, the total amount of requests for the file remains stable, making the sum of the Ps of all replicas and the original file roughly equal to the overall priority value of the file. Then, we can regard the replicas of a file as an entity that competes for available resource in the system with accumulated priority P in each round. Therefore, in each round of replica distribution, based on our design of PCS, the overall probability of creating a replica for an original file

2.2.1 ADVANTAGES:

The community-based mobility model has been used in content dissemination or routing algorithms for disconnected MANETs/DTNs to depict node mobility. In this model, the entire test area is split into different sub-areas, denoted as caves. Each cave holds one community.

RWP model, we can assume that the inter-meeting time among nodes follows exponential distribution. Then, the probability of meeting a node is independent with the previous encountered node. Therefore, we define the meeting ability of a node as the average number of nodes it meets in a unit time and use it to investigate the optimal file replication.

PCS, we used two routing protocols in the experiments. We first used the Static Wait protocol in the GENI experiment, in which each query stays on the source node waiting for the destination. We then used a probabilistic routing protocol (PROPHET) in which a node routes requests to the neighbor with the highest meeting ability.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Tools                                       :           Netbeans 7
  • Script                                       :           Java Script
  • Document                               :           MS-Office 2007


CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM


3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

OFRR PROTOCOL:

4.1 ALGORITHM

PSEUDO-CODE FOR PCS ALGORITHM:

4.2 MODULES:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

4.3 MODULE DESCRIPTION:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

CHAPTER 8

8.1 CONCLUSION & FUTURE:

In this paper, we investigated the problem of how to allocate limited resources for file replication for the purpose of global optimal file searching efficiency in MANETs. Unlike previous protocols that only consider storage as resources, we also consider file holder’s ability to meet nodes as available resources since it also affects the availability of files on the node. We first theoretically analyzed the influence of replica distribution on the average querying delay under constrained available resources with two mobility models, and then derived an optimal replication rule that can allocate resources to file replicas with minimal average querying delay.

Finally, we designed the priority competition and split replication protocol (PCS) that realizes the optimal replication rule in a fully distributed manner. Extensive experiments on both GENI testbed, NS-2, and event-driven simulator with real traces and synthesized mobility confirm both the correctness of our theoretical analysis and the effectiveness of PCS in MANETs. In this study, we focus on a static set of files in the network. In our future work, we will theoretically analyze a more complex environment including file dynamics (file addition and deletion, file timeout) and dynamic node querying pattern.

k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

k-Nearest Neighbor Classification overSemantically Secure Encrypted Relational DataBharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEEAbstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among governmentagencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of variousprivacy issues, many theoretical and practical solutions to the classification problem have been proposed under different securitymodels. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encryptedform, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preservingclassification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. Inparticular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality ofdata, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop asecure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposedprotocol using a real-world dataset under different parameter settings.Index Terms—Security, k-NN classifier, outsourced databases, encryptionÇ1 INTRODUCTIONRECENTLY, the cloud computing paradigm [1] is revolutionizingthe organizations’ way of operating their dataparticularly in the way they store, access and process data.As an emerging computing paradigm, cloud computingattracts many organizations to consider seriously regardingcloud potential in terms of its cost-efficiency, flexibility, andoffload of administrative overhead. Most often, organizationsdelegate their computational operations in addition totheir data to the cloud. Despite tremendous advantages thatthe cloud offers, privacy and security issues in the cloud arepreventing companies to utilize those advantages. Whendata are highly sensitive, the data need to be encryptedbefore outsourcing to the cloud. However, when data areencrypted, irrespective of the underlying encryption scheme,performing any data mining tasks becomes very challengingwithout ever decrypting the data. There are other privacyconcerns, demonstrated by the following example.Example 1. Suppose an insurance company outsourced itsencrypted customers database and relevant data miningtasks to a cloud. When an agent from the companywants to determine the risk level of a potential newcustomer, the agent can use a classification method todetermine the risk level of the customer. First, theagent needs to generate a data record q for thecustomer containing certain personal information ofthe customer, e.g., credit score, age, marital status, etc.Then this record can be sent to the cloud, and thecloud will compute the class label for q. Nevertheless,since q contains sensitive information, to protect thecustomer’s privacy, q should be encrypted before sendingit to the cloud.The above example shows that data mining overencrypted data (denoted by DMED) on a cloud also needsto protect a user’s record when the record is a part of a datamining process. Moreover, cloud can also derive useful andsensitive information about the actual data items by observingthe data access patterns even if the data are encrypted[2], [3]. Therefore, the privacy/security requirements of theDMED problem on a cloud are threefold: (1) confidentialityof the encrypted data, (2) confidentiality of a user’s queryrecord, and (3) hiding data access patterns.Existing work on privacy-preserving data mining(PPDM) (either perturbation or secure multi-party computation(SMC) based approach) cannot solve the DMED problem.Perturbed data do not possess semantic security, sodata perturbation techniques cannot be used to encrypthighly sensitive data. Also the perturbed data do not producevery accurate data mining results. Secure multi-partycomputation based approach assumes data are distributedand not encrypted at each participating party. In addition,many intermediate computations are performed based onnon-encrypted data. As a result, in this paper, we proposednovel methods to effectively solve the DMED problemassuming that the encrypted data are outsourced to a cloud.Specifically, we focus on the classification problem since itis one of the most common data mining tasks. Because eachclassification technique has their own advantage, to be concrete,this paper concentrates on executing the k-nearestneighbor classification method over encrypted data in thecloud computing environment._ B.K. Samanthula is with the Department of Computer Science, PurdueUniversity, 305 N. University Street, West Lafayette, IN 47907.E-mail: bsamanth@purdue.edu._ Y. Elmehdwi and W. Jiang are with the Department of Computer Science,Missouri University of Science and Technology, 310 CS Building,500 West 15th St., Rolla, MO 65409. E-mail: {ymez76, wjiang}@mst.edu.Manuscript received 23 Oct. 2013; revised 10 Sept. 2014; accepted 29 Sept.2014. Date of publication 19 Oct. 2014; date of current version 27 Mar. 2015.Recommended for acceptance by G. Miklau.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2364027IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 12611041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.1.1 Problem DefinitionSuppose Alice owns a database D of n records t1; . . . ; tn andm þ 1 attributes. Let ti;j denote the jth attribute value ofrecord ti. Initially, Alice encrypts her database attributewise,that is, she computes Epkðti;jÞ, for 1 _ i _ n and1 _ j _ m þ 1, where column ðm þ 1Þ contains the classlabels. We assume that the underlying encryption scheme issemantically secure [4]. Let the encrypted database bedenoted by D0. We assume that Alice outsources D0 as wellas the future classification process to the cloud.Let Bob be an authorized user who wants to classify hisinput record q ¼ hq1; . . . ; qmi by applying the k-NN classificationmethod based on D0. We refer to such a process asprivacy-preserving k-NN (PPkNN) classification overencrypted data in the cloud. Formally, we define thePPkNN protocol as:PPkNNðD0; qÞ ! cq;where cq denotes the class label for q after applying k-NNclassification method on D0 and q.1.2 Our ContributionsIn this paper, we propose a novel PPkNN protocol, a securek-NN classifier over semantically secure encrypted data. Inour protocol, once the encrypted data are outsourced to thecloud, Alice does not participate in any computations.Therefore, no information is revealed to Alice. In addition,our protocol meets the following privacy requirements:_ Contents of D or any intermediate results should notbe revealed to the cloud._ Bob’s query q should not be revealed to the cloud._ cq should be revealed only to Bob. Also, no otherinformation should be revealed to Bob._ Data access patterns, such as the records correspondingto the k-nearest neighbors of q, should not berevealed to Bob and the cloud (to prevent any inferenceattacks).We emphasize that the intermediate results seen by the cloudin our protocol are either newly generated randomizedencryptions or random numbers. Thus, which data recordscorrespond to the k-nearest neighbors and the output classlabel are not known to the cloud. In addition, after sendinghis encrypted query record to the cloud, Bob does notinvolve in any computations. Hence, data access patterns arefurther protected from Bob (see Section 5 for more details).The rest of the paper is organized as follows. We discussthe existing related work and some concepts as a backgroundin Section 2. A set of privacy-preserving protocolsand their possible implementations are provided in Section3. The formal security proofs for the mentioned privacy-preservingprimitives are provided in Section 4. The proposedPPkNN protocol is explained in detail in Section 5. Section 6discusses the performance of the proposed protocol underdifferent parameter settings. We conclude the paper alongwith future work in Section 7.2 RELATED WORK AND BACKGROUNDDue to space limitations, here we briefly review the existingrelated work and provide some definitions as a background.Please refer to our technical report [5] for a more elaboratedrelated work and background.At first, it seems fully homomorphic cryptosystems (e.g.,[6]) can solve the DMED problem since it allows a thirdparty(that hosts the encrypted data) to execute arbitraryfunctions over encrypted data without ever decryptingthem. However, we stress that such techniques are veryexpensive and their usage in practical applications have yetto be explored. For example, it was shown in [7] that evenfor weak security parameters one “bootstrapping” operationof the homomorphic operation would take at least30 seconds on a high performance machine.It is possible to use the existing secret sharing techniquesin SMC, such as Shamir’s scheme [8], to develop a PPkNNprotocol. However, our work is different from the secretsharing based solution in the following aspect. Solutionsbased on the secret sharing schemes require at least threeparties whereas our work require only two parties. Forexample, the constructions based on Sharemind [9], a wellknownSMC framework which is based on the secret sharingscheme, assumes that the number of participating partiesis three. Thus, our work is orthogonal to Sharemind andother secret sharing based schemes.2.1 Privacy-Preserving Data MiningAgrawal and Srikant [10], Lindell and Pinkas [11] werethe first to introduce the notion of privacy-preservingunder data mining applications. The existing PPDM techniquescan broadly be classified into two categories: (i)data perturbation and (ii) data distribution. Agrawal andSrikant [10] proposed the first data perturbation techniqueto build a decision-tree classifier, and many othermethods were proposed later (e.g., [12], [13], [14]). However,as mentioned earlier in Section 1, data perturbationtechniques cannot be applicable for semantically secureencrypted data. Also, they do not produce accurate datamining results due to the addition of statistical noises tothe data. On the other hand, Lindell and Pinkas [11] proposedthe first decision tree classifier under the two-partysetting assuming the data were distributed between them.Since then much work has been published using SMCtechniques (e.g., [15], [16], [17]). We claim that the PPkNNproblem cannot be solved using the data distributiontechniques since the data in our case is encrypted and notdistributed in plaintext among multiple parties. For thesame reasons, we also do not consider secure k-NN methodsin which the data are distributed between two parties(e.g., [18]).2.2 Query Processing over Encrypted DataVarious techniques related to query processing overencrypted data have been proposed, e.g., [19], [20], [21].However, we observe that PPkNN is a more complex problemthan the execution of simple kNN queries overencrypted data [22], [23]. For one, the intermediate k-nearestneighbors in the classification process, should not be disclosedto the cloud or any users. We emphasize that therecent method in [23] reveals the k-nearest neighbors to theuser. Second, even if we know the k-nearest neighbors, it isstill very difficult to find the majority class label amongthese neighbors since they are encrypted at the first place to1262 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015prevent the cloud from learning sensitive information.Third, the existing work did not addressed the access patternissue which is a crucial privacy requirement from theuser’s perspective.In our most recent work [24], we proposed a novelsecure k-nearest neighbor query protocol over encrypteddata that protects data confidentiality, user’s query privacy,and hides data access patterns. However, as mentionedabove, PPkNN is a more complex problem and itcannot be solved directly using the existing secure k-nearestneighbor techniques over encrypted data. Therefore,in this paper, we extend our previous work in [24] andprovide a new solution to the PPkNN classifier problemover encrypted data.More specifically, this paper is different from our preliminarywork [24] in the following four aspects. First, inthis paper, we introduced new security primitives,namely secure minimum (SMIN), secure minimum out ofn numbers (SMINn), secure frequency (SF), and proposednew solutions for them. Second, the work in [24] did notprovide any formal security analysis of the underlyingsub-protocols. On the other hand, this paper provides formalsecurity proofs of the underlying sub-protocols aswell as the PPkNN protocol under the semi-honest model.Additionally, we discuss various techniques throughwhich the proposed PPkNN protocol can possibly beextended to a protocol that is secure under the malicioussetting. Third, our preliminary work in [24] addressesonly secure kNN query which is similar to Stage 1 ofPPkNN. However, Stage 2 in PPkNN is entirely new.Finally, our empirical analyses in Section 6 are based on areal dataset whereas the results in [24] are based on asimulated dataset. Furthermore, new experimental resultsare included in this paper.2.3 Threat ModelWe adopt the security definitions in the literature of securemulti-party computation [25], [26], and there are three commonadversarial models under SMC: semi-honest, covertand malicious. In this paper, to develop secure and efficientprotocols, we assume that parties are semi-honest. Briefly,the following definition captures the properties of a secureprotocol under the semi-honest model [27], [28].Definition 1. Let ai be the input of party Pi, PiðpÞ be Pi’s executionimage of the protocol p and bi be the output for party Picomputed from p. Then, p is secure if PiðpÞ can be simulatedfrom ai and bi such that distribution of the simulated image iscomputationally indistinguishable from PiðpÞ.In the above definition, an execution image generallyincludes the input, the output and the messages communicatedduring an execution of a protocol. To prove a protocolis secure under semi-honest model, we generally need toshow that the execution image of a protocol does not leakany information regarding the private inputs of participatingparties [28].2.4 Paillier CryptosystemThe Paillier cryptosystem is an additive homomorphic andprobabilistic public-key encryption scheme whose securityis based on the Decisional Composite Residuosity Assumption[4]. Let Epk be the encryption function with public keypk given by (N; g), where N is a product of two large primesof similar bit length and g is a generator in Z_N2 . Also, let Dskbe the decryption function with secret key sk. For any giventwo plaintexts a; b 2 ZN, the Paillier encryption schemeexhibits the following properties:(1) Homomorphic addition.DskðEpkða þ bÞÞ ¼ DskðEpkðaÞ _ EpkðbÞmodN2Þ:(2) Homomorphic multiplication.DskðEpkða _ bÞÞ ¼ DskðEpkðaÞbmodN2Þ:(3) Semantic security. The encryption scheme is semanticallysecure[28], [29]. Briefly, given a set of ciphertexts,an adversary cannot deduce any additionalinformation about the plaintext(s).For succinctness, we drop the modN2 term during homomorphicoperations in the rest of this paper.3 PRIVACY-PRESERVING PRIMITIVESHere we present a set of generic sub-protocols that willbe used in constructing our proposed k-NN protocol inSection 5. All of the below protocols are considered undertwo-party semi-honest setting. In particular, we assumethe existence of two semi-honest parties P1 and P2 suchthat the Paillier’s secret key sk is known only to P2whereas pk is public._ Secure multiplication (SM). This protocol considers P1with input ðEpkðaÞ; EpkðbÞÞ and outputs Epkða _ bÞ toP1, where a and b are not known to P1 and P2. Duringthis process, no information regarding a and b isrevealed to P1 and P2._ Secure squared euclidean distance (SSED). In this protocol,P1 with input ðEpkðXÞ; EpkðY ÞÞ and P2 with sksecurely compute the encryption of squared euclideandistance between vectors X and Y . Here X andY are m dimensional vectors where EpkðXÞ ¼hEpkðx1Þ; . . . ; EpkðxmÞi and EpkðYÞ ¼ hEpkðy1Þ; . . . ;EpkðymÞi. The output EpkðjX _ Y j2Þ will be knownonly to P1._ Secure bit-decomposition (SBD). Here P1 with inputEpkðzÞ and P2 securely compute the encryptions ofthe individual bits of z, where 0 _ z < 2l. The output½z_ ¼ hEpkðz1Þ; . . . ; EpkðzlÞi is known only to P1. Herez1 and zl are the most and least significant bits ofinteger z, respectively._ Secure minimum. In this protocol, P1 holds privateinput ðu0; v0Þ and P2 holds sk, where u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su (resp., sv)denotes the secret associated with u (resp., v). Thegoal of SMIN is for P1 and P2 to jointly compute theencryptions of the individual bits of minimum numberbetween u and v. In addition, they computeEpkðsminðu;vÞÞ. That is, the output is ð½minðu; vÞ_;Epkðsminðu;vÞÞÞ which will be known only to P1.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1263During this protocol, no information regarding thecontents of u; v; su; and sv is revealed to P1 and P2._ Secure minimum out of n numbers. In this protocol, weconsider P1 with n encrypted vectors ð½d1_; . . . ; ½dn_Þalong with their respective encrypted secrets and P2with sk. Here ½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi wheredi;1 and di;l are the most and least significant bitsof integer di respectively, for 1 _ i _ n. The secretof di is given by sdi . P1 and P2 jointly compute½minðd1; . . . ; dnÞ_. In addition, they computeEpkðsminðd1;…;dnÞÞ. At the end of this protocol, the outputð½minðd1; . . . ; dnÞ_; Epkðsminðd1;…;dnÞÞÞ is knownonly to P1. During SMINn, no information regardingany of di’s and their secrets is revealed to P1 and P2._ Secure Bit-OR (SBOR). P1 with input ðEpkðo1Þ;Epkðo2ÞÞ and P2 securely compute Epkðo1 _ o2Þ, whereo1 and o2 are 2 bits. The output Epkðo1 _ o2Þ is knownonly to P1._ Secure frequency. Here P1 with private inputðhEpkðc1Þ; . . .EpkðcwÞi; hEpkðc01Þ; . . . ; Epkðc0kÞiÞ and P2securely compute the encryption of the frequency ofcj, denoted by fðcjÞ, in the list hc01; . . . ; c0ki, for1 _ j _ w. Here we explicitly assume that cj’s areunique and c0i 2 fc1; . . . ; cwg, for 1 _ i _ k. The outputhEpkðfðc1ÞÞ; . . .; EpkðfðcwÞÞi will be known onlyto P1. During the SF protocol, no information regardingc0i, cj, and fðcjÞ is revealed to P1 and P2, for1 _ i _ k and 1 _ j _ w.Now we either propose a new solution or refer to themost efficient known implementation to each of theabove protocols. First of all, efficient solutions to SM,SSED, SBD and SBOR were discussed in [24]. Therefore,in this paper, we discuss SMIN, SMINn, and SF problemsin detail and propose new solutions to each one ofthem.Secure minimum. In this protocol, we assume that P1holds private input ðu0; v0Þ and P2 holds sk, whereu0 ¼ ð½u_; EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su and svdenote the secrets corresponding to u and v, respectively.The main goal of SMIN is to securely compute theencryptions of the individual bits of minðu; vÞ, denotedby ½minðu; vÞ_. Here ½u_ ¼ hEpkðu1Þ; . . . ; EpkðulÞi and ½v_ ¼hEpkðv1Þ; . . . ; EpkðvlÞi, where u1 (resp., v1) and ul (resp., vl)are the most and least significant bits of u (resp., v), respectively.In addition, they compute Epkðsminðu;vÞÞ, the encryptionof the secret corresponding to the minimum valuebetween u and v. At the end of SMIN, the outputð½minðu; vÞ_; Epkðsminðu;vÞÞÞ is known only to P1.We assume that 0 _ u; v < 2l and propose a novelSMIN protocol. Our solution to SMIN is mainly motivatedfrom the work of [24]. Precisely, the basic idea ofthe proposed SMIN protocol is for P1 to randomly choosethe functionality F (by flipping a coin), where F is eitheru > v or v > u, and to obliviously execute F with P2.Since F is randomly chosen and known only to P1, theresult of the functionality F is oblivious to P2. Based onthe comparison result and chosen F, P1 computes½minðu; vÞ_ and Epkðsminðu;vÞÞ locally using homomorphicproperties.Algorithm 1. SMINðu0; v0Þ ! ½minðu; vÞ_; Epkðsminðu;vÞÞRequire: P1 has u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ, where0 _ u; v < 2l; P2 has sk1: P1:(a). Randomly choose the functionality F(b). for i ¼ 1 to l do:_ Epkðui _ viÞ SMðEpkðuiÞ; EpkðviÞÞ_ Ti Epkðui _ viÞ_ Hi Hrii_1 _ Ti; ri 2R ZN and H0 ¼ Epkð0Þ_ Fi Epkð_1Þ _ Hi_ if F : u > v then:_ Wi EpkðuiÞ _ Epkðui _ viÞN_1_ Gi Epkðvi _ uiÞ _ Epkð^riÞ; ^ri 2R ZNelse_ Wi EpkðviÞ _ Epkðui _ viÞN_1_ Gi Epkðui _ viÞ _ Epkð^riÞ; ^ri 2R ZN_ Li Wi _ Fr0ii ; r0i 2R ZN(c). if F :u > v then: d Epkðsv _ suÞ _ EpkðrÞelse d Epkðsu _ svÞ _ EpkðrÞ, where r 2R ZN(d). G0 p1ðGÞ and L0 p2ðLÞ(e). Send d; G0 and L0 to P22: P2:(a). Receive d; G0 and L0 from P1(b). Decryption:Mi DskðL0iÞ, for 1 _ i _ l(c). if 9 j such that Mj ¼ 1 then a 1else a 0(d). if a ¼ 0 then:_ M0i Epkð0Þ, for 1 _ i _ l_ d0 Epkð0Þelse_ M0i G0i _ rN, where r 2R ZN and is different for1 _ i _ l_ d0 d _ rNd, where rd 2R ZN(e). Send M0;EpkðaÞ and d0 to P13: P1:(a). ReceiveM0;EpkðaÞ and d0 from P2(b).eMp_11 ðM0Þ and u d0 _ EpkðaÞN_r(c). _i eMi _ EpkðaÞN_^ri , for 1 _ i _ l(d). if F : u > v then:_ Epkðsminðu;vÞÞ EpkðsuÞ _ u_ Epkðminðu; vÞiÞ EpkðuiÞ _ _i, for 1 _ i _ lelse_ Epkðsminðu;vÞÞ EpkðsvÞ _ u_ Epkðminðu; vÞiÞ EpkðviÞ _ _i, for 1 _ i _ lThe overall steps involved in the SMIN protocol areshown in Algorithm 1. To start with, P1 initially chooses thefunctionality F as either u > v or v > u randomly. Then,using the SM protocol, P1 computes Epkðui _ viÞ with thehelp of P2, for 1 _ i _ l. After this, the protocol has the followingkey steps, performed by P1 locally, for 1 _ i _ l:_ Compute the encrypted bit-wise XOR between thebits ui and vi using the following formulation1Ti ¼ EpkðuiÞ _ EpkðviÞ _ Epkðui _ viÞN_2_ Compute an encrypted vector H by preserving thefirst occurrence of Epkð1Þ (if there exists one) in T byinitializing H0 ¼ Epkð0Þ. The rest of the entries of Hare computed as Hi ¼ Hrii_1 _ Ti. We emphasize that1. In general, for any two given bits o1 and o2, the propertyo1 _ o2 ¼ o1 þ o2 _ 2ðo1 _ o2Þ always holds.1264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015at most one of the entry in H is Epkð1Þ and theremaining entries are encryptions of either 0 or a randomnumber._ Then, P1 computes Fi ¼ Epkð_1Þ _ Hi. Note that“_1” is equivalent to “N _ 1” under ZN. From theabove discussions, it is clear that Fi ¼ Epkð0Þ at mostonce since Hi is equal to Epkð1Þ at most once. Also, ifFj ¼ Epkð0Þ, then index j is the position at which thebits of u and v differ first (starting from the most significantbit position).Now, depending on F, P1 creates two encrypted vectors Wand G as follows, for 1 _ i _ l:_ If F : u > v, computeWi ¼ Epkðui _ ð1 _ viÞÞ;Gi ¼ Epkðvi _ uiÞ _ Epkð^riÞ ¼ Epkðvi _ ui þ ^riÞ:_ If F : v > u, computeWi ¼ Epkðvi _ ð1 _ uiÞÞ;Gi ¼ Epkðui _ viÞ _ Epkð^riÞ ¼ Epkðui _ vi þ ^riÞ;where ^ri is a random number (hereafter denoted by 2R) inZN. The observation is that if F : u > v, then Wi ¼ Epkð1Þ iffui > vi, and Wi ¼ Epkð0Þ otherwise. Similarly, whenF : v > u, we have Wi ¼ Epkð1Þ iff vi > ui, and Wi ¼ Epkð0Þotherwise. Also, depending of F, Gi stores the encryption ofthe randomized difference between ui and vi which will beused in later computations.After this, P1 computes L by combining F and W. Moreprecisely, P1 computes Li ¼ Wi _ Fr0ii , where r0i is a randomnumber in ZN. The observation here is if 9 an index j suchthat Fj ¼ Epkð0Þ, denoting the first flip in the bits of u and v,then Wj stores the corresponding desired information, i.e.,whether uj > vj or vj > uj in encrypted form. In addition,depending on F, P1 computes the encryption of randomizeddifference between su and sv and stores it in d. Specifically,if F : u > v, then d ¼ Epkðsv _ su þ rÞ. Otherwise,d ¼ Epkðsu _ sv þ rÞ, where r 2R ZN.After this, P1 permutes the encrypted vectors G and Lusing two random permutation functions p1 and p2. Specifically,P1 computes G0 ¼ p1ðGÞ and L0 ¼ p2ðLÞ, and sendsthem along with d to P2. Upon receiving, P2 decrypts L0component-wise to get Mi ¼ DskðL0iÞ, for 1 _ i _ l, andchecks for index j. That is, if Mj ¼ 1, then P2 sets a to 1, otherwisesets it to 0. In addition, P2 computes a new encryptedvector M0 depending on the value of a. Precisely, if a ¼ 0,then M0i ¼ Epkð0Þ, for 1 _ i _ l. Here Epkð0Þ is different foreach i. On the other hand, when a ¼ 1, P2 sets M0i to the rerandomizedvalue of G0i. That is, M0i ¼ G0i _ rN, where theterm rN comes from re-randomization and r 2R ZN shouldbe different for each i. Furthermore, P2 computesd0 ¼ Epkð0Þ if a ¼ 0. However, when a ¼ 1, P2 sets d0 tod _ rNd, where rd is a random number in ZN. Then, P2 sendsM0; EpkðaÞ and d0 to P1. After receiving M0; EpkðaÞ and d0, P1computes the inverse permutation of M0 aseM¼ p_11 ðM0Þ.Then, P1 performs the following homomorphic operationsto compute the encryption of ith bit of minðu; vÞ, i.e.,Epkðminðu; vÞiÞ, for 1 _ i _ l:_ Remove the randomness fromeMi by computing_i ¼ eMi _ EpkðaÞN_^ri_ If F : u>v, compute Epkðminðu; vÞiÞ ¼ EpkðuiÞ__i ¼ Epkðui þ a _ ðvi _ uiÞÞ. Otherwise, computeEpkðminðu; vÞiÞ¼EpkðviÞ _ _i ¼ Epkðviþ a _ ðui _ viÞÞ.Also, depending on F, P1 computes Epkðsminðu;vÞÞ as follows.If F : u > v, P1 computes Epkðsminðu;vÞÞ ¼ EpkðsuÞ _ u,where u¼d0 _ EpkðaÞN_r. Otherwise, he/she computesEpkðsminðu;vÞÞ¼ EpkðsvÞ _ u.In the SMIN protocol, one main observation (upon whichwe can also justify the correctness of the final output) is thatif F : u > v, then minðu; vÞi ¼ ð1 _ aÞ _ ui þ a _ vi alwaysholds, for 1 _ i _ l. On the other hand, if F : v > u, thenminðu; vÞi ¼ a _ ui þ ð1 _ aÞ _ vi always holds. Similar conclusionscan be drawn for sminðu;vÞ. We emphasize that usingsimilar formulations one can also design a SMAX protocolto compute ½maxðu; vÞ_ and Epkðsmaxðu;vÞÞ. Also, we stressthat there can be multiple secrets of u and v that can be fedas input (in encrypted form) to SMIN and SMAX. For example,let s1u and s2u (resp., s1vand s2v) be two secrets associatedwith u (resp., v). Then the SMIN protocol takesð½u_; Epkðs1uÞ; Epkðs2uÞÞ and ð½v_; Epkðs1vÞ; Epkðs2vÞÞ as P1’s inputand outputs ½minðu; vÞ_; Epkðs1minðu;vÞÞ and Epkðs2minðu;vÞÞ to P1.Example 2. For simplicity, consider that u ¼ 55, v ¼ 58, andl ¼ 6. Suppose su and sv be the secrets associated with uand v, respectively. Assume that P1 holds ð½55_; EpkðsuÞÞð½58_; EpkðsvÞÞ. In addition, we assume that P1’s randompermutation functions are as given below. Without lossof generality, suppose P1 chooses the functionalityF : v > u. Then, various intermediate results based onthe SMIN protocol are as shown in Table 1. Followingfrom Table 1, we observe that:_ At most one of the entry in H is Epkð1Þ, namelyH3, and the remaining entries are encryptions ofeither 0 or a random number in ZN._ Index j ¼ 3 is the first position at which the correspondingbits of u and v differ.TABLE 1P1 Chooses F Asv > uWhere u ¼ 55 and v ¼ 58½u_ ½v_ Wi Gi Gi Hi Fi Li Gi’ L0i Mi _i mini1 1 0 r 0 0 _1 r 1 þr r r 0 11 1 0 r 0 0 _1 r r r r 0 10 1 1 _1 þ r 1 1 0 1 1þr r r _1 01 0 0 1 þ r 1 r r r _1 þr r r 1 11 1 0 r 0 r r r r 1 1 0 11 0 0 1 þ r 1 r r r r r r 1 1All column values are in encrypted form exceptMi column. Also, r 2R ZN isdifferent for each row and column.i = 1 2 3 4 5 6# # # # # #p1ðiÞ = 6 5 4 3 2 1p2ðiÞ = 2 1 5 6 3 4SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1265_ F3 ¼ Epkð0Þ since H3 is equal to Epkð1Þ. Also, sinceM5 ¼ 1, P2 sets a to 1._ Epkðsminðu;vÞÞ ¼ Epkða _ su þ ð1 _ aÞ _ svÞ¼ EpkðsuÞ.At the end, only P1 knows ½minðu; vÞ_ ¼ ½u_ ¼ ½55_ andEpkðsminðu;vÞÞ ¼ EpkðsuÞ.Secure minimum out of n numbers. Consider P1 with privateinput ð½d1_; . . . ; ½dn_Þ along with their encrypted secretsand P2 with sk, where 0 _ di < 2l and ½di_ ¼ hEpkðdi;1Þ;. . . ; Epkðdi;lÞi, for 1 _ i _ n. Here the secret of di is denotedby Epkðsdi Þ, for 1 _ i _ n. The main goal of the SMINn protocolis to compute ½minðd1; . . . ; dnÞ_ ¼ ½dmin_ without revealingany information about di’s to P1 and P2. In addition, theycompute the encryption of the secret corresponding to theglobal minimum, denoted by Epkðsdmin Þ. Here we constructa new SMINn protocol by utilizing SMIN as the buildingblock. The proposed SMINn protocol is an iterativeapproach and it computes the desired output in an hierarchicalfashion. In each iteration, minimum between a pair ofvalues and the secret corresponding to the minimum valueare computed (in encrypted form) and fed as input to thenext iteration, thus, generating a binary execution tree in abottom-up fashion. At the end, only P1 knows the finalresult ½dmin_ and Epkðsdmin Þ.Algorithm 2. SMINnðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_; Epkðsdn ÞÞÞ! ð½dmin_; Epkðsdmin ÞÞRequire: P1 has ðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_;Epkðsdn ÞÞÞ; P2 has sk1: P1:(a). ½d0i_ ½di_ and s0i Epkðsdi Þ, for 1 _ i _ n(b). num n2: for i ¼ 1 to dlog2 ne:(a). for 1 _ j _ num2_ _:_ if i ¼ 1 then:_ ð½d02j_1_; s02j_1Þ SMINðx; yÞ, wherex ¼ ð½d02j_1_; s02j_1Þ and y ¼ ð½d02j_; s02jÞ_ ½d02j_ 0 and s02j 0else_ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ SMINðx; yÞ, wherex ¼ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ and y ¼ ð½d02ij_1_; s02ij_1Þ_ ½d02ij_1_ 0 and s02ij_1 0(b). num num2_ _3: P1: ½dmin_ ½d01_ and EpkðsdminÞ s01The overall steps involved in the proposed SMINn protocolare highlighted in Algorithm 2. Initially, P1 assigns ½di_and Epkðsdi Þ to a temporary vector ½d0i_ and variable s0i, for1 _ i _ n, respectively. Also, he/she creates a global variablenum and initializes it to n, where num represents thenumber of (non-zero) vectors involved in each iteration.Since the SMINn protocol executes in a binary tree hierarchy(bottom-up fashion), we have dlog2 ne iterations, and in eachiteration, the number of vectors involved varies. In the firstiteration (i.e., i ¼ 1), P1 with private inputðð½d02j_1_; s02j_1Þ; ð½d02j_; s02jÞÞ and P2 with sk involve in theSMIN protocol, for 1 _ j _ num2_ _. At the end of the first iteration,only P1 knows ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ, andnothing is revealed to P2, for 1 _ j _ num2_ _. Also, P1 storesthe result ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ in ½d02j_1_ ands02j_1, respectively. In addition, P1 updates the values of½d02j_, s02j to 0 and num to num2_ _, respectively.During the ith iteration, only the non-zero vectors (alongwith the corresponding encrypted secrets) are involved inSMIN, for 2 _ i _ dlog2 ne. For example, during the seconditeration (i.e., i ¼ 2), only ð½d01_; s01Þ; ð½d03_; s03Þ, and so on areinvolved. Note that in each iteration, the output is revealedonly to P1 and num is updated to num2_ _. At the end ofSMINn, P1 assigns the final encrypted binary vector ofglobal minimum value, i.e., ½minðd1; . . . ; dnÞ_ which is storedin ½d01_, to ½dmin_. Also, P1 assigns s01 to Epkðsdmin Þ.Example 3. Suppose P1 holds h½d1_; . . . ; ½d6_i (i.e., n ¼ 6). Forsimplicity, here we are assuming that there are no secretsassociated with di’s. Then, based on the SMINn protocol,the binary execution tree (in a bottom-up fashion) tocompute ½minðd1; . . . ; d6Þ_ is shown in Fig. 1. Note that,initially ½d0i_ ¼ ½di_.Secure frequency. Let us consider a situation where P1holds private input ðhEpkðc1Þ; . . . ; EpkðcwÞi; hEpkðc01Þ;. . . ; Epkðc0kÞiÞ and P2 holds the secret key sk. The goal of theSF protocol is to securely compute EpkðfðcjÞÞ, for 1 _ j _ w.Here fðcjÞ denotes the number of times element cj occurs(i.e., frequency) in the list hc01; . . . ; c0ki. We explicitly assumethat c0i 2 fc1; . . . ; cwg, for 1 _ i _ k.The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is revealed onlyto P1. During the SF protocol, neither c0i nor cj is revealed toP1 and P2. Also, fðcjÞ is kept private from both P1 and P2,for 1 _ i _ k and 1 _ j _ w.The overall steps involved in the proposed SF protocolare shown in Algorithm 3. To start with, P1 initially computesan encrypted vector Si such that Si;j ¼ Epkðcj _ c0iÞ, for1 _ j _ w. Then, P1 randomizes Si component-wise to getS0i;j ¼ Epkðri;j _ ðcj _ c0iÞÞ, where ri;j is a random number inZN. After this, for 1 _ i _ k, P1 randomly permutes S0icomponent-wise using a random permutation function pi(known only to P1). The output Zi piðS0iÞ is sent to P2.Upon receiving, P2 decrypts Zi component-wise, computesa vector ui and proceeds as follows:_ If DskðZi;jÞ ¼ 0, then ui;j is set to 1. Otherwise, ui;j isset to 0._ The observation is, since c0i 2 fc1; . . . ; cwg, thatexactly one of the entries in vector Zi is an encryptionof 0 and the rest are encryptions of randomnumbers. This further implies that exactly one of thedecrypted values of Zi is 0 and the rest are randomnumbers. Precisely, if ui;j ¼ 1, then c0i ¼ cp_1ðjÞ.Fig. 1. Binary execution tree for n ¼ 6 based on SMINn.1266 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015_ Compute Ui;j ¼ Epkðui;jÞ and send it to P1, for1 _ i _ k and 1 _ j _ w.Then, P1 performs row-wise inverse permutation on it to getVi ¼ p_1i ðUiÞ, for 1 _ i _ k. Finally, P1 computesEpkðcjÞ ¼Qki¼1 Vi;j locally, for 1 _ j _ w.Algorithm 3. SFðL;L0Þ ! hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞiRequire: P1 has L ¼ hEpkðc1Þ; . . .;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . . ;Epkðc0kÞi and hp1; . . . ; pki; P2 has sk1: P1:(a). for i ¼ 1 to k do:_ Ti Epkðc0iÞN_1_ for j ¼ 1 to w do:_ Si;j EpkðcjÞ _ Ti_ S0i;j Si;jri;j , where ri;j 2R ZN_ Zi piðS0iÞ(b). Send Z to P22: P2:(a). Receive Z from P1(b). for i ¼ 1 to k do_ for j ¼ 1 to w do:_ if DskðZi;jÞ ¼ 0 then ui;j 1else ui;j 0_ Ui;j Epkðui;jÞ(c). Send U to P13: P1:(a). Receive U from P2(b). Vi p_1i ðUiÞ, for 1 _ i _ k(c). EpkðfðcjÞÞQki¼1 Vi;j, for 1 _ j _ w4 SECURITY ANALYSIS OF PRIVACY-PRESERVINGPRIMITIVES UNDER THE SEMI-HONEST MODELFirst of all, we emphasize that the outputs in the above mentionedprotocols are always in encrypted format, and areknown only to P1. Also, all the intermediate results revealedto P2 are either random or pseudo-random.Since the proposed SMIN protocol (which is used as asub-routine in SMINn) is more complex than other protocolsmentioned above and due to space limitations, we are motivatedto provide its security proof rather than providingproofs for each protocol. Therefore, here we only include aformal security proof for the SMIN protocol based on thestandard simulation argument [28]. Nevertheless, we stressthat similar proof strategies can be used to show that otherprotocols are secure under the semi-honest model. For completeness,we provided the security proofs for the other protocolsin our technical report [5].4.1 Proof of Security for SMINAs mentioned in Section 2.3, to formally prove that SMIN issecure [28] under the semi-honest model, we need to showthat the simulated image of SMIN is computationally indistinguishablefrom the actual execution image of SMIN.An execution image generally includes the messagesexchanged and the information computed from these messages.Therefore, according to Algorithm 1, let the executionimage of P2 be denoted by PP2 ðSMINÞ, given byfhd; s þ r modNi; hG0i;mi þ ^ri mod Ni; hL0i; aig:Observe that s þ r modN and mi þ ^ri mod N are derivedupon decrypting d and G0i, for 1 _ i _ l, respectively. Notethat the modulo operator is implicit in the decryption function.Also, P2 receives L0 from P1 and let a denote the (oblivious)comparison result computed from L0. Without loss ofgenerality, suppose the simulated image of P2 bePSP2ðSMINÞ, given byfhd_; r_i; hs01;i; s02;ii; hs03;i; a0i j for 1 _ i _ lg:Here d_; s01;i and s03;i are randomly generated from ZN2whereas r_ and s02;i are randomly generated from ZN. Inaddition, a0 denotes a random bit. Since Epk is a semanticallysecure encryption scheme with resulting ciphertextsize less than N2, d is computationally indistinguishablefrom d_. Similarly, G0i and L0i are computationally indistinguishablefrom s01;i and s03;i, respectively. Also, as r and ^riare randomly generated from ZN, s þ r mod N andmi þ ^ri modN are computationally indistinguishable fromr_ and s02;i, respectively. Furthermore, because the functionalityis randomly chosen by P1 (at step 1(a) of Algorithm 1),a is either 0 or 1 with equal probability. Thus, a is computationallyindistinguishable from a0. Combining all theseresults together, we can conclude that PP2 ðSMINÞ is computationallyindistinguishable from PSP2ðSMINÞ based on Definition1. This implies that during the execution of SMIN, P2does not learn any information regarding u; v; su; sv and theactual comparison result. Intuitively speaking, the informationP2 has during an execution of SMIN is either randomor pseudo-random, so this information does not discloseanything regarding u; v; su and sv. Additionally, as F isknown only to P1, the actual comparison result is obliviousto P2.On the other hand, the execution image of P1, denoted byPP1 ðSMINÞ, is given byPP1 ðSMINÞ ¼ fM0i; EpkðaÞ; d0 j for 1 _ i _ lg:M0i and d0 are encrypted values, which are random in ZN2 ,received from P2 (at step 3(a) of Algorithm 1). Let the simulatedimage of P1 be PSP1ðSMINÞ, wherePSP1ðSMINÞ ¼ fs04;i; b0; b00 j for 1 _ i _ lg:The values s04;i; b0 and b00 are randomly generated from ZN2 .Since Epk is a semantically secure encryption scheme withresulting ciphertext size less than N2, it implies thatM0i; EpkðaÞ and d0 are computationally indistinguishablefrom s04;i; b0 and b00, respectively. Therefore, PP1 ðSMINÞ iscomputationally indistinguishable from PSP1ðSMINÞ basedon Definition 1. As a result, P1 cannot learn any informationregarding u; v; su; sv and the comparison result during theexecution of SMIN. Putting everything together, we claimthat the proposed SMIN protocol is secure under the semihonestmodel (according to Definition 1).5 THE PROPOSED PPKNN PROTOCOLIn this section, we propose a novel privacy-preserving k-NNclassification protocol, denoted by PPkNN, which isSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1267constructed using the protocols discussed in Section 3 asbuilding blocks. As mentioned earlier, we assume thatAlice’s database consists of n records, denoted byD ¼ ht1; . . . ; tni, and m þ 1 attributes, where ti;j denotes thejth attribute value of record ti. Initially, Alice encrypts herdatabase attribute-wise, that is, she computes Epkðti;jÞ, for1 _ i _ n and 1 _ j _ m þ 1, where column ðm þ 1Þ containsthe class labels. Let the encrypted database be denotedby D0. We assume that Alice outsources D0 as well as thefuture classification process to the cloud. Without loss ofgenerality, we assume that all attribute values and theireuclidean distances lie in ½0; 2lÞ. In addition, let w denote thenumber of unique class labels in D.In our problem setting, we assume the existence of twonon-colluding semi-honest cloud service providers, denotedby C1 and C2, which together form a federated cloud. Underthis setting, Alice outsources her encrypted database D0 toC1 and the secret key sk to C2. Here it is possible for thedata owner Alice to replace C2 with her private server.However, if Alice has a private server, we can argue thatthere is no need for data outsourcing from Alice’s point ofview. The main purpose of using C2 can be motivated bythe following two reasons. (i) With limited computingresource and technical expertise, it is in the best interest ofAlice to completely outsource its data management andoperational tasks to a cloud. For example, Alice may wantto access her data and analytical results using a smart phoneor any device with very limited computing capability.(ii) Suppose Bob wants to keep his input query and accesspatterns private from Alice. In this case, if Alice uses a privateserver, then she has to perform computations assumedby C2 under which the very purpose of outsourcing theencrypted data to C1 is negated.In general, whether Alice uses a private server or cloudservice provider C2 actually depends on her resources. Inparticular to our problem setting, we prefer to use C2 as thisavoids the above mentioned disadvantages (i.e., in case ofAlice using a private server) altogether. In our solution,after outsourcing encrypted data to the cloud, Alice doesnot participate in any future computations.The goal of the PPkNN protocol is to classify users’query records using D0 in a privacy-preserving manner.Consider an authorized user Bob who wants to classifyhis query record q ¼ hq1; . . . ; qmi based on D0 in C1. Theproposed PPkNN protocol mainly consists of the followingtwo stages:_ Stage 1—Secure Retrieval of k-Nearest Neighbors(SRkNN). In this stage, Bob initially sends his queryq (in encrypted form) to C1. After this, C1 and C2involve in a set of sub-protocols to securely retrieve(in encrypted form) the class labels corresponding tothe k-nearest neighbors of the input query q. At theend of this step, encrypted class labels of k-nearestneighbors are known only to C1._ Stage 2—Secure Computation of Majority Class(SCMCk). Following from Stage 1, C1 and C2 jointlycompute the class label with a majority votingamong the k-nearest neighbors of q. At the end ofthis step, only Bob knows the class label correspondingto his input query record q.The main steps involved in the proposed PPkNN protocolare as shown in Algorithm 4. We now explain each ofthe two stages in PPkNN in detail.Algorithm 4. PPkNNðD0; qÞ ! cqRequire: C1 has D0 and p; C2 has sk; Bob has q1: Bob:(a). Compute EpkðqjÞ, for 1 _ j _ m(b). Send EpkðqÞ ¼ hEpkðq1Þ; . . .;EpkðqmÞi to C12: C1 and C2:(a). C1 receives EpkðqÞ from Bob(b). for i ¼ 1 to n do:_ EpkðdiÞ SSEDðEpkðqÞ;EpkðtiÞÞ_ ½di_ SBDðEpkðdiÞÞ3: for s ¼ 1 to k do:(a). C1 and C2:_ ð½dmin_;EpkðIÞ;Epkðc0ÞÞ SMINnðu1; . . . ; unÞ, whereui ¼ ð½di_;EpkðIti Þ;Epkðti;mþ1ÞÞ_ Epkðc0sÞ Epkðc0Þ(b). C1:_ D EpkðIÞN_1_ for i ¼ 1 to n do:_ ti EpkðiÞ _ D_ t0i trii , where ri 2R ZN_ b pðt0Þ; send b to C2(c). C2:_ b0i DskðbiÞ, for 1 _ i _ n_ Compute U0, for 1 _ i _ n:_ if b0i ¼ 0, then U0i ¼ Epkð1Þ_ otherwise, U0i ¼ Epkð0ÞSend U0 to C1(d). C1: V p_1ðU0Þ(e). C1 and C2, for 1 _ i _ n and 1 _ g _ l:_ Epkðdi;gÞ SBORðVi; Epkðdi;g ÞÞ4: SCMCkðEpkðc01Þ; . . .;Epkðc0kÞÞ5.1 Stage 1: Secure Retrieval of k-NearestNeighborsDuring Stage 1, Bob initially encrypts his query q attributewise,that is, he computes EpkðqÞ ¼ hEpkðq1Þ; . . .; EpkðqmÞiand sends it to C1. The main steps involved in Stage 1 areshown as steps 1 to 3 in Algorithm 4. Upon receiving EpkðqÞ,C1 with private input ðEpkðqÞ; EpkðtiÞÞ and C2 with the secretkey sk jointly involve in the SSED protocol. HereEpkðtiÞ ¼ hEpkðti;1Þ; . . . ; Epkðti;mÞi, for 1 _ i _ n. The outputof this step, denoted by EpkðdiÞ, is the encryption of squaredeuclidean distance between q and ti, i.e., di ¼ jq _ tij2. Asmentioned earlier, EpkðdiÞ is known only to C1, for1 _ i _ n. We emphasize that the computation of exacteuclidean distance between encrypted vectors is hard toachieve as it involves square root. However, in our problem,it is sufficient to compare the squared euclidean distances asit preserves relative ordering. Then, C1 with input EpkðdiÞand C2 securely compute the encryptions of the individualbits of di using the SBD protocol. Note that the output½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi is known only to C1, where di;1and di;l are the most and least significant bits of di, for1 _ i _ n, respectively.After this, C1 and C2 compute the encryptions of classlabels corresponding to the k-nearest neighbors of q in an1268 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015iterative manner. More specifically, they compute Epkðc01Þ inthe first iteration, Epkðc02Þ in the second iteration, and so on.Here c0s denotes the class label of sth nearest neighbor to q,for 1 _ s _ k. At the end of k iterations, only C1 knowshEpkðc01Þ; . . . ; Epkðc0kÞi. To start with, consider the first iteration.C1 and C2 jointly compute the encryptions of the individualbits of the minimum value among d1; . . . ; dn andencryptions of the location and class label corresponding todmin using the SMINn protocol. That is, C1 with inputðu1; . . . ; unÞ and C2 with sk compute ð½dmin_; EpkðIÞ; Epkðc0ÞÞ,where ui ¼ ð½di_; EpkðIti Þ; Epkðti;mþ1ÞÞ, for 1 _ i _ n. Heredmin denotes the minimum value among d1; . . . ; dn; Iti andti;mþ1 denote the unique identifier and class label correspondingto the data record ti, respectively. Specifically,ðIti; ti;mþ1Þ is the secret information associated with ti. Forsimplicity, this paper assumes Iti ¼ i. In the output, I and c0denote the index and class label corresponding to dmin. Theoutput ð½dmin_; EpkðIÞ; Epkðc0ÞÞ is known only to C1. Now, C1performs the following operations locally:_ Assign Epkðc0Þ to Epkðc01Þ. Remember that, accordingto the SMINn protocol, c0 is equivalent to the classlabel of the data record that corresponds to dmin.Thus, it is same as the class label of the most nearestneighbor to q._ Compute the encryption of difference between I andi, where 1 _ i _ n. That is, C1 computes ti ¼ EpkðiÞ_EpkðIÞN_1 ¼ Epkði _ IÞ, for 1 _ i _ n._ Randomize ti to get t0i ¼ trii ¼ Epkðri _ ði _ IÞÞ,where ri is a random number in ZN. Note that t0i isan encryption of either 0 or a random number, for1 _ i _ n. Also, it is worth noting that exactly one ofthe entries in t0 is an encryption of 0 (which happensiff i ¼ I) and the rest are encryptions of randomnumbers. Permute t0 using a random permutationfunction p (known only to C1) to get b ¼ pðt0Þ andsend it to C2.Upon receiving b, C2 decrypts it component-wise to getb0i ¼ DskðbiÞ, for 1 _ i _ n. After this, he/she computes anencrypted vector U0 of length n such that U0i ¼ Epkð1Þ ifb0i ¼ 0, and Epkð0Þ otherwise. Since exactly one of entries int0 is an encryption of 0, this further implies that exactly oneof the entries in U0 is an encryption of 1 and the rest of themare encryptions of 0’s. It is important to note that if b0k ¼ 0,then p_1ðkÞ is the index of the data record that correspondsto dmin. Then, C2 sends U0 to C1. After receiving U0, C1 performsinverse permutation on it to get V ¼ p_1ðU0Þ. Notethat exactly one of the entries in V is Epkð1Þ and the remainingare encryptions of 0’s. In addition, if Vi ¼ Epkð1Þ, then tiis the most nearest tuple to q. However, C1 and C2 do notknow which entry in V corresponds to Epkð1Þ.Finally, C1 updates the distance vectors ½di_ due to the followingreason:_ It is important to note that the first nearest tuple to qshould be obliviously excluded from further computations.However, since C1 does not know the recordcorresponding to Epkðc01Þ, we need to obliviouslyeliminate the possibility of choosing this recordagain in next iterations. For this, C1 obliviouslyupdates the distance corresponding to Epkðc01Þ to themaximum value, i.e., 2l _ 1. More specifically, C1updates the distance vectors with the help of C2using the SBOR protocol as below, for 1 _ i _ n and1 _ g _ lEpkðdi;gÞ ¼ SBOR_Vi; Epkðdi;gÞ_:Note that when Vi ¼ Epkð1Þ, the corresponding distancevector di is set to the maximum value. That is,under this case, ½di_ ¼ hEpkð1Þ; . . . ; Epkð1Þi. On theother hand, when Vi ¼ Epkð0Þ, the OR operation hasno effect on the corresponding encrypted distancevector.The above process is repeated until k iterations, and ineach iteration ½di_ corresponding to the current chosen labelis set to the maximum value. However, C1 and C2 doesnot know which ½di_ is updated. In iteration s, Epkðc0sÞ isknown only to C1. At the end of Stage 1, C1 hashEpkðc01Þ; . . .; Epkðc0kÞi, the list of encrypted class labels ofk-nearest neighbors to the query q.5.2 Stage 2: Secure Computation of Majority ClassWithout loss of generality, let us assume that Alice’s datasetD consists of w unique class labels denoted by c ¼hc1; . . . ; cwi. We assume that Alice outsources her list ofencrypted classes to C1. That is, Alice outsourceshEpkðc1Þ; . . . ; EpkðcwÞi to C1 along with her encrypted databaseD0 during the data outsourcing step. Note that, forsecurity reasons, Alice may add dummy categories into thelist to protect the number of class labels, i.e., w from C1 andC2. However, for simplicity, we assume that Alice does notadd any dummy categories to c.During Stage 2, C1 with private inputs L ¼ hEpkðc1Þ; . . . ;EpkðcwÞi and L0 ¼ hEpkðc01Þ; . . . ; Epkðc0kÞi, and C2 with sksecurely compute EpkðcqÞ. Here cq denotes the majority classlabel among c01; . . . ; c0k. At the end of stage 2, only Bob knowsthe class label cq.Algorithm 5. SCMCkðEpkðc01Þ; . . .; Epkðc0kÞÞ ! cqRequire: hEpkðc1Þ; . . .; EpkðcwÞi, hEpkðc01Þ; . . .;Epkðc0kÞi are knownonly to C1; sk is known only to C21: C1 and C2:(a). hEpkðfðc1ÞÞ; . . . ;EpkðfðcwÞÞi SFðL;L0Þ, whereL ¼ hEpkðc1Þ; . . . ;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . .; Epkðc0kÞi(b). for i ¼ 1 to w do:_ ½fðciÞ_ SBDðEpkðfðciÞÞÞ(c). ð½fmax_;EpkðcqÞÞ SMAXwðc1; . . . ;cwÞ, whereci ¼ ð½fðciÞ_; EpkðciÞÞ, for 1 _ i _ w2: C1:(a). gq EpkðcqÞ _ EpkðrqÞ, where rq 2R ZN(b). Send gq to C2 and rq to Bob3: C2:(a). Receive gq from C1(b). g0q DskðgqÞ; send g0q to Bob4: Bob:(a). Receive rq from C1 and g0q from C2(b). cq g0q _ rq modNThe overall steps involved in Stage 2 are shown inAlgorithm 5. To start with, C1 and C2 jointly compute theSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1269encrypted frequencies of each class label using the k-nearestset as input. That is, they compute EpkðfðciÞÞ using ðL;L0Þas C1’s input to the secure frequency (SF) protocol, for1 _ i _ w. The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is knownonly to C1. Then, C1 with EpkðfðciÞÞ and C2 with sk involvein the secure bit-decomposition protocol to compute ½fðciÞ_,that is, vector of encryptions of the individual bits of fðciÞ,for 1 _ i _ w. After this, C1 and C2 jointly involve in theSMAXw protocol. Briefly, SMAXw utilizes the sub-routineSMAX to eventually compute ð½fmax_; EpkðcqÞÞ in an iterativefashion. Here ½fmax_ ¼ ½maxðfðc1Þ; . . . ; fðcwÞÞ_ and cq denotesthe majority class out of L0. At the end, the outputð½fmax_; EpkðcqÞÞ is known only to C1. After this, C1 computesgq ¼ Epkðcq þ rqÞ, where rq is a random number in ZNknown only to C1. Then, C1 sends gq to C2 and rq to Bob.Upon receiving gq, C2 decrypts it to get the randomizedmajority class label g0q ¼ DskðgqÞ and sends it to Bob. Finally,upon receiving rq from C1 and g0q from C2, Bob computes theoutput class label corresponding to q as cq ¼ g0q _ rq mod N.5.3 Security Analysis of PPkNN under theSemi-Honest ModelFirst of all, we stress that due to the encryption of q and bysemantic security of the Paillier cryptosystem, Bob’s inputquery q is protected from Alice, C1 and C2 in our PPkNNprotocol. Apart from guaranteeing query privacy, the goalof PPkNN is to protect data confidentiality and hide dataaccess patterns.In this paper, to prove a protocol’s security under thesemi-honest model, we adopted the well-known securitydefinitions from the literature of SMC. More specifically, asmentioned in Section 2.3, we adopt the security proofsbased on the standard simulation paradigm [28]. For presentationpurpose, we provide formal security proofs(under the semi-honest model) for Stages 1 and 2 of PPkNNseparately. Note that the outputs returned by each sub-protocolare in encrypted form and known only to C1.5.3.1 Proof of Security for Stage 1As mentioned earlier, the computations involved in Stage 1of PPkNN are given as steps 1 to 3 in Algorithm 4. For simplicity,we consider the messages exchanged between C1and C2 in a single iteration (similar analysis can be deducedfor other iterations).According to Algorithm 4, the execution image of C2 isgiven by PC2 ðPPkNNÞ ¼ fhbi; b0ii j for 1 _ i _ ng where bi isan encrypted value which is random in ZN2 . Also, b0i isderived upon decrypting bi by C2. Remember that, exactlyone of the entries in b0 is 0 and the rest are random numbersin ZN. Without loss of generality, let the simulated image ofC2 be given PSC2ðPPkNNÞ ¼ fha01;i; a02;ii j for 1 _ i _ ng. Herea01;i is randomly generated from ZN2 and the vector a02 is randomlygenerated in such a way that exactly one of theentries is 0 and the rest are random numbers in ZN. SinceEpk is a semantically secure encryption scheme with resultingciphertext size less than ZN2 , we claim that bi is computationallyindistinguishable from a01;i. In addition, since therandom permutation function p is known only to C1, b0 is arandom vector of exactly one 0 and random numbers in ZN.Thus, b0 is computationally indistinguishable from a02. Bycombining the above results, we can conclude thatPC2 ðPPkNNÞ is computationally indistinguishable fromPSC2ðPPkNNÞ. This implies that C2 does not learn anythingduring the execution of Stage 1.On the other hand, the execution image of C1 is given byPC1 ðPPkNNÞ ¼ fU0g where U0 is an encrypted value sent byC2 (at step 3(c) of Algorithm 4). Let the simulated image ofC1 in Stage 1 be PSC1ðPPkNNÞ ¼ fa0g. Here the value of a0 israndomly generated from ZN2 . Since Epk is a semanticallysecure encryption scheme with resulting ciphertexts in ZN2 ,we claim that U0 is computationally indistinguishable froma0. This implies that PC1 ðPPkNNÞ is computationally indistinguishablefrom PSC1ðPPkNNÞ. Hence, C1 cannot learnanything during the execution of Stage 1 in PPkNN. Combiningall these results together, it is clear that Stage 1 issecure under the semi-honest model.In each iteration, it is worth pointing out that C1 andC2 do not know which data record belongs to currentglobal minimum. Thus, data access patterns are protectedfrom both C1 and C2. Informally speaking, at step 3(c) ofAlgorithm 4, a component-wise decryption of b revealsthe tuple that satisfy the current global minimum distanceto C2. However, due to the random permutation by C1, C2cannot trace back to the corresponding data record. Also,note that decryption operations on vector b by C2 willresult in exactly one 0 and the rest of the results are randomnumbers in ZN. Similarly, since U0 is an encryptedvector, C1 cannot know which tuple corresponds to currentglobal minimum distance.5.3.2 Security Proof for Stage 2In a similar fashion, we can formally prove that Stage 2 ofPPkNN is secure under the semi-honest model. Briefly,since the sub-protocols SF, SBD, and SMAXw are secure, noinformation is revealed to C2. Also, the operations performedby C1 are entirely on encrypted data and thus noinformation is revealed to C1.Furthermore, the output data of Stage 1 which are passedas input to Stage 2 are in encrypted format. Therefore, thesequential composition of the two stages lead to our PPkNNprotocol and we claim it to be secure under the semi-honestmodel according to the Composition Theorem [28]. In particular,based on the above discussions, it is clear that theproposed PPkNN protocol protects the confidentiality ofthe data, the user’s input query, and also hides data accesspatterns from Alice, C1; and C2. Note that Alice does notparticipate in any computations of PPkNN.5.4 Security under the Malicious ModelThe next step is to extend our PPkNN protocol into a secureprotocol under the malicious model. Under the maliciousmodel, an adversary (i.e., either C1 or C2) can arbitrarilydeviate from the protocol to gain some advantage (e.g.,learning additional information about inputs) over the otherparty. The deviations include, as an example, for C1 (actingas a malicious adversary) to instantiate the PPkNN protocolwith modified inputs (say Epkðq0Þ and Epkðt0iÞÞ and to abortthe protocol after gaining partial information. However, inPPkNN, it is worth pointing out that neither C1 nor C21270 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015knows the results of Stages 1 and 2. In addition, all the intermediateresults are either random or pseudo-random values.Thus, even when an adversary modifies theintermediate computations he/she cannot gain any additionalinformation. Nevertheless, as mentioned above, theadversary can change the intermediate data or performcomputations incorrectly before sending them to the honestparty which may eventually result in the wrong output.Therefore, we need to ensure that all the computations performedand messages sent by each party are correct.Remember that the main goal of SMC is to ensure thehonest parties to get the correct result and to protect theirprivate input data from the malicious parties. Therefore,under the two-party SMC scenario, if both parties are malicious,there is no point to develop or adopt an SMC protocolat the first place. In the literature of SMC [30], it is the normthat at most one party can be malicious under the two-partyscenario. When only one of the party is malicious, the standardway of preventing the malicious party from misbehavingis to let the honest party validate the other party’s workusing zero-knowledge proofs [31]. However, checking thevalidity of operations at each step of PPkNN can significantlyincrease the cost.An alternative approach, as proposed in [32], is to instantiatetwo independent executions of the PPkNN protocol byswapping the roles of the two parties in each execution. Atthe end of the individual executions, each party receives theoutput in encrypted form. This is followed by an equalitytest on their outputs. More specifically, suppose Epk1 ðcq;1Þand Epk2 ðcq;2Þ be the outputs received by C1 and C2 respectively,where pk1 and pk2 are their respective public keys.Note that the outputs in our case are in encrypted formatand the corresponding ciphertexts (resulted from the twoexecutions) are under two different public key domains.Therefore, we stress that the equality test based on the additivehomomorphic encryption properties which was used in[32] is not applicable to our problem. Nevertheless, C1 andC2 can perform the equality test based on the traditionalgarbled-circuit technique [33].5.5 Complexity AnalysisThe total computation complexity of Stage 1 is bounded byOðn _ ðl þ m þ k _ l _ log2 nÞÞ encryptions and exponentiations.On the other hand, the total computation complexityof Stage 2 is bounded by Oðw _ ðl þ k þ l _ log2 wÞÞ encryptionsand exponentiations. Due to space limitations, werefer the reader to [5] for detailed complexity analysis ofPPkNN. In general, as w _ n, the computation cost of Stage1 should be significantly higher than that of Stage 2. Thisobservation is further justified by our empirical resultsgiven in the next section.6 EMPIRICAL RESULTSIn this section, we discuss some experiments demonstratingthe performance of our PPkNN protocol under differentparameter settings. We used the Paillier cryptosystem [4] asthe underlying additive homomorphic encryption schemeand implemented the proposed PPkNN protocol in C. Variousexperiments were conducted on a Linux machine withan Intel Xeon Six-Core CPU 3.07 GHz processor and 12 GBRAM running Ubuntu 12.04 LTS. To the best of our knowledge,our work is the first effort to develop a secure k-NNclassifier under the semi-honest model. There is no existingwork to compare with our approach. Hence, we evaluatethe performance of our PPkNN protocol under differentparameter settings.6.1 Dataset and Experimental SetupFor our experiments, we used the Car Evaluation datasetfrom the UCI KDD archive [34]. It consists of 1,728 records(i.e., n ¼ 1; 728) and six attributes (i.e., m ¼ 6). Also, there isa separate class attribute and the dataset is categorized intofour different classes (i.e., w ¼ 4). We encrypted this datasetattribute-wise, using the Paillier encryption whose key sizeis varied in our experiments, and the encrypted data werestored on our machine. Based on our PPkNN protocol, wethen executed a random query over this encrypted data. Forthe rest of this section, we do not discuss about the performanceof Alice since it is a one-time cost. Instead, we evaluateand analyze the performances of the two stages inPPkNN separately.6.2 Performance of PPkNNWe first evaluated the computation costs of Stage 1 inPPkNN for varying number of k-nearest neighbors. Also,the Paillier encryption key size K is either 512 or 1,024 bits.The results are shown in Fig. 2a. For K ¼ 512 bits, the computationcost of Stage 1 varies from 9.98 to 46.16 minuteswhen k is changed from 5 to 25, respectively. On the otherhand, when K ¼ 1;024 bits, the computation cost of Stage 1varies from 66.97 to 309.98 minutes when k is changed from5 to 25, respectively. In either case, we observed that thecost of Stage 1 grows almost linearly with k. For any givenk, we identified that the cost of Stage 1 increases by almost afactor of 7 whenever K is doubled. E.g., when k ¼ 10, StageFig. 2. Computation costs of PPkNN for varying number of k nearest neighbors and encryption key size K.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 12711 took 19.06 and 127.72 minutes to generate the encryptedclass labels of the 10 nearest neighbors under K ¼ 512 and1024 bits, respectively. Moreover, when k ¼ 5, we observethat around 66.29 percent of cost in Stage 1 is accounted dueto SMINn which is initiated k times in PPkNN (once in eachiteration). Also, the cost incurred due to SMINn increasesfrom 66.29 to 71.66 percent when k is increased from 5 to 25.We now evaluate the computation costs of Stage 2 forvarying k and K. As shown in Fig. 2b, for K ¼ 512 bits, thecomputation time for Stage 2 to generate the final class labelcorresponding to the input query varies from 0.118 to 0.285seconds when k is changed from 5 to 25. On the other hand,for K ¼ 1; 024 bits, Stage 2 took 0.789 and 1.89 secondswhen k ¼ 5 and 25, respectively. The low computation costsof Stage 2 were due to SMAXw which incurs significantlyless computations than SMINn in Stage 1. This further justifiesour theoretical analysis in Section 5.5. Note that, in ourdataset, w ¼ 4 and n ¼ 1;728. Like in Stage 1, for any givenk, the computation time of Stage 2 increases by almost a factorof 7 whenever K is doubled. E.g., when k ¼ 10, the computationtime of Stage 2 varies from 0.175 to 1.158 secondswhen the encryption key size K is changed from 512 to1,024 bits. As shown in Fig. 2b, a similar analysis can beobserved for other values of k and K.It is clear that the computation cost of Stage 1 is significantlyhigher than that of Stage 2 in PPkNN. Specifically,we observed that the computation time of Stage 1 accountsfor at least 99 percent of the total time in PPkNN. For example,when k ¼ 10 and K ¼ 512 bits, the computation costs ofStage 1 and 2 are 19.06 minutes and 0.175 seconds, respectively.Under this scenario, cost of Stage 1 is 99.98 percent ofthe total cost of PPkNN. We also observed that the totalcomputation time of PPkNN grows almost linearly with nand k.6.3 Performance Improvement of PPkNNWe now discuss two different ways to boost the efficiency ofStage 1 (as the performance of PPkNN depends primarilyon Stage 1) and empirically analyze their efficiency gains.First, we observe that some of the computations in Stage 1can be pre-computed. For example, encryptions of randomnumbers, 0 and 10s can be pre-computed (by the correspondingparties) in the offline phase. As a result, the onlinecomputation cost of Stage 1 (denoted by SRkNNo) isexpected to be improved. To see the actual efficiency gainsof such a strategy, we computed the costs of SRkNNo andcompared them with the costs of Stage 1 without an offlinephase (simply denoted by SRkNN) and the results forK ¼ 1;024 bits are shown in Fig. 2c. Irrespective of the valuesof k, we observed that SRkNNo is around 33 percentfaster than SRkNN. E.g., when k ¼ 10, the computationcosts of SRkNNo and SRkNN are 84.47 and 127.72 minutes,respectively (boosting the online running time of Stage 1 by33.86 percent).Our second approach to improve the performance ofStage 1 is by using parallelism. Since operations on datarecords are independent of one another, we claim that mostcomputations in Stage 1 can be parallelized. To empiricallyevaluate this claim, we implemented a parallel version ofStage 1 (denoted by SRkNNp) using OpenMP programmingand compared its cost with the costs of SRkNN (i.e., theserial version of Stage 1). The results for K ¼ 1;024 bits areshown in Fig. 2c. The computation cost of SRkNNp variesfrom 12.02 to 55.5 minutes when k is changed from 5 to 25.We observe that SRkNNp is almost six times more efficientthan SRkNN. This is because our machine has six cores andthus computations can be run in parallel on six separatethreads. Based on the above discussions, it is clear that efficiencyof Stage 1 can indeed be improved significantly usingparallelism.On the other hand, Bob’s computation cost in PPkNNis mainly due to the encryption of his input query. In ourdataset, Bob’s computation cost is 4 and 17 millisecondswhen K is 512 and 1,024 bits, respectively. It is apparentthat PPkNN is very efficient from Bob’s computationalperspective which is especially beneficial when he issuesqueries from a resource-constrained device (such asmobile phone and PDA).6.3.1 A Note on PracticalityOur PPkNN protocol is not very efficient without utilizingparallelization. However, ours is the first work to propose aPPkNN solution that is secure under the semi-honestmodel. Due to rising demands for data mining as a servicein cloud, we believe that our work will be very helpful tothe cloud community to stimulate further research alongthat direction. Hopefully, more practical solutions toPPkNN will be developed (either by optimizing our protocolor investigating alternative approaches) in the nearfuture.7 CONCLUSIONS AND FUTURE WORKTo protect user privacy, various privacy-preserving classificationtechniques have been proposed over the past decade.The existing techniques are not applicable to outsourceddatabase environments where the data resides in encryptedform on a third-party server. This paper proposed a novelprivacy-preserving k-NN classification protocol overencrypted data in the cloud. Our protocol protects the confidentialityof the data, user’s input query, and hides the dataaccess patterns. We also evaluated the performance of ourprotocol under different parameter settings.Since improving the efficiency of SMINn is an importantfirst step for improving the performance of our PPkNN protocol,we plan to investigate alternative and more efficientsolutions to the SMINn problem in our future work. Also,we will investigate and extend our research to other classificationalgorithms.ACKNOWLEDGMENTSThe authors wish to thank the anonymous reviewers fortheir invaluable feedback and suggestions. This work hasbeen partially supported by the US National Science Foundationunder grant CNS-1011984.TA 1273

Innovative Schemes for Resource Allocation in the Cloud for Media Streaming Applications

—Media streaming applications have recently attracted a large number of users in the Internet. With the advent of thesebandwidth-intensive applications, it is economically inefficient to provide streaming distribution with guaranteed QoS relying only oncentral resources at a media content provider. Cloud computing offers an elastic infrastructure that media content providers (e.g., Videoon Demand (VoD) providers) can use to obtain streaming resources that match the demand. Media content providers are charged forthe amount of resources allocated (reserved) in the cloud. Most of the existing cloud providers employ a pricing model for the reservedresources that is based on non-linear time-discount tariffs (e.g., Amazon CloudFront and Amazon EC2). Such a pricing scheme offersdiscount rates depending non-linearly on the period of time during which the resources are reserved in the cloud. In this case, an openproblem is to decide on both the right amount of resources reserved in the cloud, and their reservation time such that the financial coston the media content provider is minimized. We propose a simple—easy to implement—algorithm for resource reservation thatmaximally exploits discounted rates offered in the tariffs, while ensuring that sufficient resources are reserved in the cloud. Based onthe prediction of demand for streaming capacity, our algorithm is carefully designed to reduce the risk of making wrong resourceallocation decisions. The results of our numerical evaluations and simulations show that the proposed algorithm significantly reducesthe monetary cost of resource allocations in the cloud as compared to other conventional schemes.Index Terms—Media streaming, cloud computing, non-linear pricing models, network economicsÇ1 INTRODUCTIONMEDIA streaming applications have recently attractedlarge number of users in the Internet. In 2010, thenumber of video streams served increased 38.8 percent to24.92 billion as compared to 2009 [1]. This huge demand createsa burden on centralized data centers at media contentproviders such as Video-on-Demand (VoD) providers tosustain the required QoS guarantees [2]. The problembecomes more critical with the increasing demand forhigher bit rates required for the growing number of higherdefinitionvideo quality desired by consumers. In thispaper, we explore new approaches that mitigate the cost ofstreaming distribution on media content providers usingcloud computing.A media content provider needs to equip its data-centerwith over-provisioned (excessive) amount of resources inorder to meet the strict QoS requirements of streaming traffic.Since it is possible to anticipate the size of usage peaksfor streaming capacity in a daily, weekly, monthly, andyearly basis, a media content provider can make long terminvestments in infrastructure (e.g., bandwidth and computingcapacities) to target the expected usage peak. However,this causes economic inefficiency problems in view of flashcrowdevents. Since data-centers of a media content providerare equipped with resources that target the peakexpected demand, most servers in a typical data-center of amedia content provider are only used at about 30 percent oftheir capacity [3]. Hence, a huge amount of capacity at theservers will be idle most of the time, which is highly wastefuland inefficient.Cloud computing creates the possibility for media contentproviders to convert the upfront infrastructure investmentto operating expenses charged by cloud providers (e.g., Netflix moved its streaming servers to Amazon WebServices (AWS) [4], [5]). Instead of buying over-provisionedservers and building private data-centres, media contentproviders can use computing and bandwidth resources ofcloud service providers. Hence, a media content providercan be viewed as a re-seller of cloud resources, where itpays the cloud service provider for the streaming resources(bandwidth) served from the cloud directly to clients of themedia content provider. This paradigm reduces theexpenses of media content providers in terms of purchaseand maintenance of over-provisioned resources at theirdata-centres.In the cloud, the amount of allocated resources can bechanged adaptively at a fine granularity, which is commonlyreferred to as auto-scaling. The auto-scaling ability ofthe cloud enhances resource utilization by matching thesupply with the demand. So far, CPU and memory are thecommon resources offered by the cloud providers (e.g.,Amazon EC2 [6]). However, recently, streaming resources(bandwidth) have become a feature offered by many cloudproviders to users with intensive bandwidth demand (e.g.,Amazon CloudFront and Octoshape) [5], [7], [8], [9]._ A. Alasaad and H.M. Behairy are with the National Center for Electronics,Communications, and Photonics, King Abdulaziz City for Science andTechnology, Riyadh, Saudi Arabia.E-mail: {alasaad, hbehairy}@kacst.edu.sa._ K. Shafiee and V.C.M. Leung are with the Department of Electrical andComputer Engineering, University of British Columbia, Vancouver, BC,Canada. E-mail : {kshafiee, vleung}@ece.ubc.ca.Manuscript received 7 Nov. 2013; revised 23 Jan. 2014; accepted 24 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.Recommended for acceptance by H. Wu.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316827IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 10211045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.The delay sensitive nature of media streaming trafficposes unique challenges due to the need for guaranteedthroughput (i.e., download rate no smaller than the videoplayback rate) in order to enable users to smoothly watchvideo content on-line. Hence, the media content providerneeds to allocate streaming resources in the cloud such thatthe demand for streaming capacity can be sustained at anyinstant of time.The common type of resource provisioning plan that isoffered by cloud providers is referred to as on-demandplan. This plan allows the media content provider to purchaseresources upon needed. The pricing model that cloudproviders employ for the on-demand plan is the pay-peruse.Another type of streaming resource provisioning plansthat is offered by many cloud providers is based on resourcereservation. With the reservation plan, the media contentprovider allocates (reserves) resources in advance and pricingis charged before the resources are utilized (upon receivingthe request by the cloud provider, i.e., prepaidresources). The reserved streaming resources are basicallythe bandwidth (streaming data-rate) at which the cloud providerguarantees to deliver to clients of the media contentprovider (content viewers) according to the required QoS.In general, the prices (tariffs) of the reservation plan arecheaper than those of the on-demand plan (i.e., time discountrates are only offered to the reserved (prepaid)resources). We consider a pricing model for resource reservationin the cloud that is based on non-linear time-discounttariffs. In such a pricing scheme, the cloud serviceprovider offers higher discount rates to the resourcesreserved in the cloud for longer times. Such a pricingscheme enables a cloud service provider to better utilize itsabundantly available resources because it encourages consumersto reserve resources in the cloud for longer times.This pricing scheme is currently being used by many cloudproviders [10]. See for example the pricing of virtualmachines (VM) in the reservation phase defined by AmazonEC2 in February 2010. In this case, an open problem isto decide on both the optimum amount of resourcesreserved in the cloud (i.e., the prepaid allocated resources),and the optimum period of time during which thoseresources are reserved such that the monetary cost on themedia content provider is minimized. In order for a mediacontent provider to address this problem, prediction offuture demand for streaming capacity is required to helpwith the resource reservation planning. Many methodshave been proposed in prior works to predict the demandfor streaming capacity [11], [12], [13], [14].Our main contribution in this paper is a practical—easyto implement—Prediction-Based Resource Allocation algorithm(PBRA) that minimizes the monetary cost of resourcereservation in the cloud by maximally exploiting discountedrates offered in the tariffs, while ensuring that sufficientresources are reserved in the cloud with some level ofconfidence in probabilistic sense. We first describe the systemmodel. We formulate the problem based on the predictionof future demand for streaming capacity (Section 3).We then describe the design of our proposed algorithm forsolving the problem (Section 4).The results of our numerical evaluations and simulationsshow that the proposed algorithms significantly reduce themonetary cost of resource allocations in the cloud as comparedto other conventional schemes.2 RELATED WORKThe prediction of CPU utilization and user access demandfor web-based applications has been extensively studied inthe literature. A prediction method has been proposed withrespect to upcoming CPU utilization pattern demandsbased on neural networking and linear regression that is ofinterest in e-commerce applications [15]. Y. Lee et al. proposeda prediction method based on radial basis function(RBF) networks to predict the user access demand requestfor web type of services in web-based applications [16].Although the demand prediction for CPU utilization andweb applications has been studied for a relatively longperiod of time, the prediction of demand for media streaminghas gained popularity more recently [11], [12], [13], [14].The access behaviour of users in peer-to-peer (P2P) streamingwith time-series analysis techniques using non-stationarytime-series models was predicted in [11]. The method oftime-series prediction based on wavelet analysis was studiedin [12]. In [13], principal component analysis isemployed by the authors to extract the access pattern ofstreaming users. Although most of the above studies predictthe average streaming capacity demands, few papers havealso studied the volatility of the capacity demand, i.e., thedemand variance at any future point in time, which yieldsmore accurate risk factors [14]. The prediction of streamingbandwidth demand is outside the scope of this paper. Inthis work, we formulate the problem considering a givenprobability distribution function of prediction of futuredemand for streaming bandwidth. In addition to demandprediction for resource reservation, other relevant studieshave addressed the appropriate joint reservation of bandwidthresources on multiple cloud service providers withthe purpose of maximizing bandwidth utilization [12], [14].In [17], an adaptive resource provisioning scheme is presentedthat optimizes the bandwidth utilization whilesatisfying the required levels of QoS. Maximization of bandwidthutilization in turn helps cloud service providersreduce their expenses and maximize their revenues. In [18],an optimization framework for making dynamic resourceallocation decisions under risky and uncertain operatingenvironments was developed to maximize revenue whilereducing operating costs. This framework considered multipleclient QoS classes under uncertainty of workloads.Recently, streaming resources (e.g., bandwidth) havebecome a feature offered by many cloud providers to contentproviders with intensive bandwidth demand. Thestreaming of media content to content viewers located atdifferent geographical regions at guaranteed data-rate is apart of the service offered by the cloud provider. The commonway of implementing this service in the cloud is byhaving multiple data-centres inside the networks of theaccess connection providers (e.g., Internet Service Providers,ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20]. Cloud service providers may need tonegotiate contracts with a number of ISPs to co-locate theirservers into the networks of those ISPs. In this regard,another group of papers have focused on studying different1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015types of contracts between cloud service providers and ISPswith the purpose of minimizing the expenses of cloud providers[21]. However, an interesting design approach is tolook at the resource reservation problem from the viewpointof content providers. Obviously, content providers are moreinterested in minimizing their costs, i.e., the amount ofmoney that they are charged directly by cloud providers.To the best of our knowledge, very few studies haveinvestigated the problem of optimizing resource reservationwith the objective of minimizing the monetary costs for contentproviders. A good example is presented in [22],wherein a resource reservation optimization problem wasformulated to minimize the costs of content providers, socalledcloud consumers, using a stochastic programmingmodel. In the process of problem formulation, uncertaindemand and uncertain cloud providers’ resource prices areconsidered. In contrast, the optimization problem formulatedin our work takes into account a given probability distributionfunction obtained from aforementioned studiesfor the prediction of media streaming demands. Furthermore,the problem of cost minimization is addressed by utilizingthe discounted rates offered in the non-linear tariffs.To the best of our knowledge, none of the previous papershas investigated the problem of cost minimization for mediacontent providers in terms of monetary expenses by takinginto account both the penalties caused by the over-provisionedor under-provisioned reserved resources, and theadvance purchase of resources at cloud providers for justthe right period of time.3 SYSTEM MODEL AND PROBLEM FORMULATIONThe system model that we advocate in this paper for mediastreaming using cloud computing consists of the followingcomponents (Fig. 1)._ Demand forecasting module, which predicts thedemand of streaming capacity for every video channelduring future period of time._ Cloud broker, which is responsible on behalf of themedia content provider for both allocating theappropriate amount of resources in the cloud, andreserving the time over which the required resourcesare allocated. Given the demand prediction, the brokerimplements our proposed algorithm to makedecision on resource allocations in the cloud.Both the demand forecasting module and thecloud broker are located in the media content providersite._ Cloud provider, which provides the streamingresources and delivers streaming traffic directly tomedia viewers.In this paper, we consider the case, wherein the cloudprovider charges media content providers for the reservedresources according to the period of time during which theresources are reserved in the cloud. In this case, the cloudprovider offers higher discount rates to the resourcesreserved in the cloud for longer times.Non-linear time-discount is a very popular pricingmodel. Non-linear tariffs are those with marginal rates varyingwith quantity purchased and time rented. Time discountrates are available in purchasing most types of goods.Products or services with time usage (e.g., rental cars, rentalreal-estates, loans, long distance telephone cards, photocopiers)are typically offered with variety of plans (pricingschemes) depending on the period of time the product isconsumed (reserved). It has been shown that such pricingschemes enable sellers to increase their revenues [23]. Manycloud providers also use such a pricing scheme [10]. See forexample pricing of virtual machines in reservation phasedefined by Amazon EC2 in February 2010. An example oftariffs using such a pricing scheme is shown in Fig. 2. Wecan see that the tariff is a function of both units of allocatedresources and reservation time.We observe the following dilemma: how can the mediacontent provider reserve sufficient resources in the cloud—based on the prediction of future streaming demand—suchthat no resource wastage is incurred, while QoS for theactual (real) streaming traffic is maintained with some levelof confidence (h) in probabilistic sense? Moreover, how canthe media content provider utilize the non-linear tariffs(time discount rates offered to the reserved (prepaid)resources) to minimize its monetary cost?Consider a video channel offered by a media content provider.Let DðtÞ be the actual demand for streaming capacityof the video channel at an instant of time t, and measured asthe number of users that stream the channel at instant oftime t multiplied by the data rate required for every downloadinguser to meet QoS guarantees. It has been shownthat DðtÞ is a random process that follows a log-normalFig. 2. An example of tariffs as function of allocated resources and reservationtime.Fig. 1. System model.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1023distribution with mean E½DðtÞ_ and variance (s) characterizedin [11] and [14], respectively.We denote the amount of streaming bandwidth that themedia content provider allocates in the cloud at any timeinstant t by AllocðtÞ. Since DðtÞ is a random process, themedia content provider needs to maintain reserved resourcesin the cloud AllocðtÞ such that in any instant of time,ProbabilityðDðtÞ _ AllocðtÞÞ _ h; (1)where h is a pre-determined threshold (level of confidence).Note that a higher h means a higher degree ofconfidence, in a probabilistic sense, that the reservedresources in the cloud AllocðtÞ meet the QoS guaranteesfor the actual streaming traffic at any future time instant t.However, increasing h increases the probability of wastageof reserved bandwidth (i.e., over-subscribed cost).Hence, proper selection of h is necessary. We shall proposean algorithm that determines the best value of h inSection 5. In this section, our objective is to find the rightamount of reserved resources and their corresponding reservationtime such that the monetary cost required forstreaming a video content (channel) is minimized giventhe constraint in Eq. (1).4 ALGORITHM DESIGNWe summarize the assumptions that we use in our analysisas follows:1) We assume that upon receiving the resource allocationrequest by the cloud provider from the mediacontent provider, the resources required are immediatelyallocated in the cloud, i.e., updating the cloudconfiguration and launching instances in cloud datacentresincurs no delay.2) Since the only resource that we consider in this workis bandwidth, it would be important to delve intothe relation between the cloud provider and contentdelivery networks (CDN). However, we assume thatthe provisioning of media content to media viewers(clients of the media content provider) located at differentgeographical regions at guaranteed data-rateis a part of the service offered by the cloud provider.The common way of implementing this service inthe cloud is by having multiple data-centres insidethe networks of the access connection providers(e.g., ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20].3) We assume that the media content provider ischarged for the reserved resources in the cloudupon making the request for resource reservation(i.e., prepaid resources); and therefore, the mediacontent provider cannot revoke, cancel, or change arequest for resource reservation previously submittedto the cloud.4) In clouds, tariffs (prices of different amount ofreserved resources in $ per unit of reservation time)are often given in a tabular form. Therefore, thecloud service provider requires a minimum reservationtime for any allocated resources, and onlyallows discrete levels (categories) of the amount ofallocated resources in the cloud. See for example thereservation phase in the Amazon CloudFrontresource provisioning plans [7].We take into account the aforementioned constraints andpropose a practical—easy to implement—algorithm forresource reservation in the cloud, such that the financialcost on the media content provider is minimized.Suppose that the media content provider can predict thedemand for streaming capacity of a video channel (i.e., thestatistical expected value of the demand E½DðtÞ_ is known)over a future period of time L using one of the methods in[11], [12], [13], [14]. The content provider reserves resourcesin the cloud according to the predicted demand. The proposedalgorithm is based on time-slots with varied durations(sizes). In every time-slot, the media content providermakes a decision to reserve amount of resources in thecloud. Both the amount of resources to be reserved and theperiod of time over which the reservation is made (durationof time-slots) vary from one time-slot to another, and aredetermined in our algorithm to yield the minimum overallmonetary cost (Fig. 3).We alternatively call a time-slot a window, and denote thewindow size (duration of the time-slot) by w. Since theactual demand varies during a window size, while allocatedresources in the cloud remain the same for the entire windowsize (according to the third assumption above), thealgorithm needs to reserve resources in every window jthat are sufficient to handle the maximum predicteddemand for streaming capacity during that window withsome probabilistic level of confidence h.We denote the amount of reserved resources in window jby Allocj. Since the decision on the amount of reservedresources is affected by the wrong prediction of futurestreaming demand, our on-line algorithm is carefullydesigned to obtain accurate demand prediction (by enablinga mechanism that continuously updates the demand forecastmodule according to the actual demand received at themedia content provider over time) in order to reduce therisk of making wrong resource reservation decisions (Fig. 1).We denote the monetary cost of the reserved resourcesduring window j by Costðwj;AllocjÞ, and can becomputed asCostðwj; AllocjÞ ¼ tariffðwj; AllocjÞ _ wj; (2)where tariffðwj; AllocjÞ represents the price (in $ per timeunit) charged by the cloud provider for amount of resourcesFig. 3. PBRA algorithm design.1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015Allocj reserved for period of time (window size) wj. Notethat the values of tariff and Cost in any window j dependon both the amount of allocated resources (Allocj) and theperiod of time over which resources are reserved (wj). Alsonote that the algorithm runs on-the-fly. More specifically,the demand forecast module predicts streaming capacitydemand in the upcoming period of time L and feeds thisinformation to our algorithm. The algorithm upon receivingthe demand prediction, computes the right size of windowj (i.e., w_j ), and the right amount of reserved resources inwindow j (i.e., Alloc_j ), such that the cost of the reservedresources during window j (i.e., Costðwj; AllocjÞ in (2)) isminimized; or equivalently, the discounted rates offered inthe tariffs are maximally utilized.Hence, the objective of our algorithm is to minimizeCostðwj; AllocjÞ 8j, subject toProbabilityðDðtÞ _ AllocðtÞÞ _ h; 8 t 2 L:In other words, our objective is to minimize the monetarycost of reserved resources such that the amount of reservedresources at any instant of time is guaranteed to meet theactual demand with probabilistic confidence equals to h. Aswe have discussed earlier, DðtÞ is a random process that followsa log-normal distribution with mean E½DðtÞ_ and variance(s) characterized in [11] and [14], respectively. Thus,using the constraint above, and for any window size wj, wecan compute the minimum amount of required reservedresources during window j (Allocj) by solving the followingformula for AllocjZAllocj01x _ sffiffiffiffiffiffi2p p e       1 2ðlnðxÞ       mmaxs Þ2dx ¼ h; (3)where mmax is the maximum value of the predictedstreaming demand during the window j (i.e., mmax ¼ argmaxðE½DðtÞ_Þ 8 t 2 wj). Note that the Equation (3) followsfrom the log-normal probabilistic distribution of thedemand for streaming capacity.As we have discussed earlier, the cloud service provideroften requires a minimum reservation time for any allocatedresources (wmin), and only allows discrete levels (categories)of reservation times for any amount of allocated resourcesin the cloud. We therefore, assume that any reservationtime required at the cloud has to be in multiplicative orderof wmin (i.e., wj ¼ k _ wmin, where k is a positive integer).Thus, the algorithm employs a trial window (wh) to assist inmaking optimum decision on the size of every window j. Inparticular, for every window j, the algorithm starts an iterationprocess with a trial window of size wh ¼ wmin, andcomputes the cost rate (Xh ¼ tariffðwh; AllochÞ, where h isiteration index), and Alloch is computed by solving Eq. (3)for Alloc.Recall that due to the time discount rates offered in thetariffs, increasing the time during which the allocatedresources are reserved may lead to less monetary cost(higher discounted rate) on the media content provider(Fig. 2). However, increasing the window size (time-slot)significantly may also result in high over-provisioning(over-subscribed) cost as the media content provider has toallocate resources in the cloud that meet the highestdemand during the window period. Thus, in order torecognize whether the cost is decreasing or increasing withincreasing the window size, the trial window size (wh) isincreased one wmin unit in every iteration (i.e.,wh ¼ wh þ wmin) and the cost rate of this new trial windowsize is computed (Xhþ1). The algorithm keeps increasing thetrial window size until wh ¼ L in order to scan the entireperiod of time over which the demand was predicted (L)(Fig. 3), and finds the value of wh that yields the minimumcost; that is the optimum size of window j (w_j ). Since L isthe period of time over which the future demand is predicted,then wmin _ w_j _ L.During every window, the media content providerreceives the real (actual) streaming demand for the videochannel, which may be different from the predicteddemand. According to the actual demand, the demandforecast module updates its prediction and feeds thealgorithm with a newly predicted demand for anotherfuture period of time L (Fig. 1). The algorithm uponreceiving the updated demand prediction, computes theoptimum size of the next window, and reserves optimumresources in the next window, and so on. Thepseudo code for the proposed algorithm is shown inAlgorithm 1. In order to further clarify operations of theproposed algorithm (which we call it Prediction-BasedResource Allocation algorithm), an example is given inthe following.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 10254.1 Example: Finding the Right Amount of ReservedResources in Window j and Their ReservationtimeConsider the normalized predicted streaming demandgiven in Fig. 4 for a future period of time L ¼ 12. Letwmin ¼ 1; and let h ¼ 0:75. Assume that the amount ofreserved resources in the cloud can only take integer numbersof unit of resources (i.e., cloud provider applies certainlevels (categories) on the amount of allowed reservedresources, AllocðtÞ 2 f1; 2; 3; . . .g.For the given predicted demand, our algorithm findsthe optimum size of every window j and optimumamount of reserved resources in window j as follows.The algorithm starts iterations to determine the size ofthe first window (i.e., wj¼1). In the first iteration (h ¼ 1),wh¼1 ¼ 1, we can see that the maximum predicteddemand when wh¼1 ¼ 1 is 0:63 (Fig. 4). Thus, we havemmaxh ¼ 0:63. Using Eq. (3), we have Alloch¼1 ¼ 0:81.Since the cloud allows only discrete levels for reservedresources in the cloud, then Alloch¼1 must be rounded tothe nearest upper value allowed in the cloud. Thus,Alloch¼1 ¼ 1. Using tariff functions shown in Fig. 2, wehave the cost rate Xh ¼ tariffðwh¼1 ¼ 1; Alloch¼1 ¼ 1Þ ¼ 11.The iterations continue until wh ¼ L.We summarize the results of all iterations h performedfor window j ¼ 1 using our proposed algorithm in Table 1.From the table, we can see that the minimum value of costrate Xh is when h_ ¼ 10. Hence, the optimum window sizeis w_j ¼1 ¼ wh¼10 ¼ 10, and the optimum amount of reservedresources during window j ¼ 1 is Alloc_j¼1 ¼ Alloch¼10 ¼ 2.Similarly, we can find the optimum window size and optimumamount of resources in the next window (j ¼ 2) givenan updated prediction of the demand in another period offuture time L.5 HYBRID APPROACH FOR RESOURCEPROVISIONINGIn this section, we consider the case, wherein the cloud provideroffers two different types of streaming resource provisioningplans: the reservation plan and the on-demandplan. With the reservation plan, the media content providerreserves resources in advance and pricing is charged beforethe resources are utilized (upon receiving the request at thecloud provider, i.e., prepaid resources). With the ondemandplan, the media content provider allocates streamingresources upon needed. Pricing in the on-demand planis charged by pay-per-use basis. In general, the prices (tariffs)of the reservation plan are cheaper than those of the ondemandplan (i.e., time discount rates are only offered tothe reserved (prepaid) resources). Amazon CloudFront [7],Amazon EC2 [6], GoGrid [24], MS Azure, Op-Source, andTerre-mark are examples of cloud providers which offerInfrastructure-as-a-Service (IaaS) with both plans [10].When the media content provider only uses theresource reservation plan, the under-provisioning problemcan occur if the reserved (prepaid) resources are unable tofully meet the actual demand due to high fluctuatingdemand or prediction mismatch. Also, over-provisioningproblem can occur if the reserved (prepaid) resources aremore than the actual demand, in which parts of thereserved resources are wasted. However, when the cloudprovider offers both the reservation plan and the ondemandplan, the media content provider can allocateresources in the cloud more efficiently. In particular, themedia content provider can use reservation plan to benefitfrom the time-discounted rate, while use the on-demandplan to dynamically allocate streaming resources to its clientsat the moment when the reserved resources allocatedusing the reservation plan are unable to meet the actualdemand and extra resources are needed to fit the fluctuatedand unpredictable demands (e.g., flash crowd). Wecall this approach hybrid resource provisioning. This hybridapproach eliminates both the over-provisioning (over-subscribed)cost and the under-provisioning problem thatmay occur when using the reservation plan only.In this hybrid resource provisioning approach, tradeoffbetween the amount of resources allocated using the ondemandplan and the amount of resources allocated usingthe reservation plan needs to be adjusted in which thehybrid approach can optimally perform. In this section, wepropose an algorithm for this hybrid resource provisioningapproach that maximally benefits from the time discountedrate offered in the resource reservation plan, while eliminatingany over-provisioning cost of reserved resources suchthat the overall monetary cost of resource allocations in thecloud (including both the reserved resources and the ondemandresources) is minimized.Fig. 4. An example of predicted demand over a period of future timeL ¼ 12.TABLE 1Example: Summary of Results for Iterations Executed for Window j ¼ 11026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015As we have described in the previous section (Section4), the cost of allocated resources using the reservationplan depends on the parameter h. We referred to has the level of confidence. We have shown that usinghigher value of h results in higher amount of reservedresources in the cloud, and vice-versa. However, increasingthe value of h for the reserved resources may lead tothe over-provisioning problem, while decreasing the valueof h may lead to the under-provisioning problem. Sincepricing of resource allocation in the on-demand plan ishigher than the reservation plan, one may erroneouslybelieve that increasing the value of h would alwaysreduce the overall monetary cost since the portion ofreserved (discounted) resources in the cloud is increased.However, reserving too many resources (i.e., using highvalue of h for the reserved (prepaid) resources) may befar from optimal because it may significantly increasethe over-provisioning (over-subscribed) cost. Hence, thishybrid approach requires that the content provider selectthe right value of h for the reserved resources. Our proposedalgorithm in this section computes the optimumvalue of h (h_) that yields the minimum overall monetarycost of resource allocations in the cloud (both reservedand on-demand resources) when the media content provideruses this hybrid resource provisioning approach.Let us again assume that the media content provider canpredict the demand for a future period of time L. Let Chybridbe the price that the media content provider expects topay to the cloud provider for all streaming resource allocatedin the cloud using the hybrid approach (i.e., Chybrid isthe statistical mean of the cost). We can see that Chybrid is thesummation of two terms: the price charged for the reservedresources in every window j using the reservation plan(denoted by CRSVj ), and the expected cost of resources allocatedin the cloud during every window j using the ondemandplan (denoted by CODj ). Hence,Chybrid ¼Xj ðCRSVj þ CODj Þ: (4)Let AllocRSVj be the amount of reserved resources in windowj, while AllocODj be the amount of on-demand resources allocatedin window j. Let tariffðwRSVj ; AllocRSVj Þ be the tariffcharged for the reserved resources in window j, whiletariffðAllocODj Þ be the tariff charged for the on-demandresources in windowj. Note that the cost rate of the resourcesreservation plan, tariffðwRSVj ; AllocRSVj Þ, depends on bothwRSVj and AllocRSVj; while tariffðAllocODj Þ depends only onthe amount of allocated resources AllocODj . This is becauseno time discount rate is offered to the on-demand resources.Let x be a random variable representing the demand forstreaming capacity in any instant of time during window j,and fðxÞ be the probability density function of variable x.Note that when the amount of reserved resources in windowj (AllocRSVj ) is known, CODj can be computed by consideringthe event when AllocRSVj < x < 1. This isbecause when x < AllocRSVj , the amount of reservedresources in the cloud is sufficient to handle the actualstreaming demand and no need to allocate extra resourcesusing the on-demand plan. Thus, we can compute the costof reserved resources in window j (in $) asCRSVj ¼ wj _ tariffðwRSVj ; AllocRSVj Þ; (5)and consequently the expected (statistical mean) cost of theon-demand resources in window j can be computed asCODj ¼ wj _Z1AllocRSVjfðxÞ_tariffðx     AllocRSVj Þ dx:(6)We shall consider a log-normal statistical probability distributionfðxÞ as discussed earlier [11], [14]. Thus, fðxÞ inEq. (6) can be written asfðxÞ ¼1x _ sffiffiffiffiffiffi2p p e       12_ðlnðxÞ      mmaxs Þ2:As we have described in Section 4, the right amount ofreserved resources in window j (AllocRSVj ) can be determinedgiven the parameter h. Thus, Chybrid in Eq. (4) is afunction of the parameter h only. Our objective is to minimizeChybrid in Eq. (4), or equivalently determining the valueof h that minimizes the overall cost of allocated resourcesusing the hybrid approach. It is straight forward to showthat Chybrid is convex with respect to h. Thus, in order tominimize Chybrid, we need to find the optimum value of h(h_) using Equations (5) and (6).We can see that h_ can be easily solved numerically forevery window j if tariff functions are given (i.e.,tariffðt; AllocRSV ðtÞÞ and tariffðAllocODðtÞÞ for any durationof resource allocation). However, as we have discussedearlier, tariffs are often given in a tabular form. Moreover,the cloud service provider often requires a minimum reservationtime for any allocated resources, and only allowsdiscrete levels (categories) of allocated resources in thecloud. We take into account those constraints and proposean efficient heuristic algorithm for this hybridresource provisioning approach. The pseudo code of theproposed algorithms is shown in Algorithm 2.The algorithm works as follows. Suppose that h takesdiscrete values, and the total possible values of h is S. Forevery window j, the iteration process described in Algorithm1 is performed for every value of h in order to computeboth the right amount of reserved resources(AllocRSVj ) and the right time over which these resourcesare reserved (wRSVj ). When the amount of reserved resourcesin window j is determined, the amount of extra resourcesthat must be allocated using the on-demand plan inorder to fulfil the predicted streaming demand can be easilycomputed as AllocODj ¼ mmax      AllocRSVj, where mmaxis the maximum value of the predicted streaming demandduring window j. Thus, the total corresponding cost rateof allocated resources in window j is computed asXh ¼ tariffðRSVj; AllocRSVjÞ þ tariffðAllocODj Þ, where h isthe iteration index. The iteration process continues, andout of all values of Xh computed for different values of h,the algorithm finds h_ corresponding to the minimumvalue. The algorithm is repeated for every window.We can see that the complexity of the proposed algorithm(measured in terms of number of iterations requiredfor every window) is Oð Lwmin _ SÞ. Thus, increasing the size ofALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1027S increases the complexity of the algorithm, but alsoincreases the accuracy of the algorithm. However, the complexityof our algorithm linearly scales with size of the input(S), which means that our algorithm executes efficiently.6 PERFORMANCE EVALUATIONWe first analytically derive a demand prediction functionthat we shall use in our performance evaluations (Section6.1). We then investigate the performance of our simple“on-line” Prediction-Based Resource Allocation algorithmproposed for reserving resources in the cloud, in terms ofboth monetary cost of reserved resources in the cloud andcomplexity (CPU time) (Section 6.2). We then compare theperformance of PBRA proposed for reserving resources inthe cloud against two other schemes: Fixed window sizeresource reservation scheme, and pay-as-you-go resourceallocation scheme (Section 6.2.2). Finally, we evaluate theperformance of our hybrid resource allocation algorithmproposed for the case when the cloud provider offers twostreaming resource provisioning plans: the reservation andon-demand, and show that our algorithm significantlyreduces the overall cost of resource allocation (Section 6.3).6.1 Demand ModelAs we have discussed so far, prediction of the future demandfor streaming capacity is required in order for the media contentprovider (e.g., VoD) to optimally reserve resources inthe cloud. In this section, we use a special case of the demandin which the function of expected (mean) future streamingdemand for a video channel (i.e., E½DðtÞ_) can be easily formulatedanalytically. Specifically, we assume that all mediastreaming demand for a video channel available at a localVoD provider is generated from users located in a single privatenetwork (e.g., users in a college or office campuses).What distinguishes the evolution of interest in a mediacontent among users of a private network from the Internetis that users in a private network are often socially connected(e.g., friends/colleagues in a social network). Thoseusers form a community and share similar interests. Thus,the demand of a media content grows quickly in the privatenetwork as interested users contact others (by either broadcastingthe knowledge about existence of the media contentto their friends in the social network, e.g., facebook, or usingEmail-group broadcast) and make them interested. However,the interest (demand) tapers off when a certain cumulativelevel of interest among users of the private network isreached. For example, a student, in a class of 100 students,can spread the knowledge about a video content to his classmates.If the popularity of this content among students inthe class is 0.2, the evolution of the demand increasesquickly over time as interested users contact others, buttapers off when all potential number of interested studentsin the class (20 students) get interested in the content andviewed the content. When all 20 students finish viewing thevideo content, the life-time of that content in this communitynetwork expires.We analytically characterize this viral evolution of interestin a media content among users of a private network.Let us assume that the number of friends to whom a user isconnected in a social network (node’s degree) at any instantof time on average is N. Let us further assume that a userwho receives the notification about the existence of the contentgets interested with probability p and re-broadcasts thenotification, in turn, to his friends on the social network,where p is the expected popularity of the content amongusers of the private network. We further assume that userswho receive multiple notifications for the same content donot rebroadcast the message.If the social network graph is fully connected (i.e., a notificationabout existence of the content reaches all users inthe private network), we can then use the fluid-flow modelto write the evolution of interest in a media content asdIðtÞdt ¼ IðtÞ½pðN  gðtÞ _ NÞ_;where IðtÞ be the total number of interested users in the contentat time t (cumulative interest). ðgðtÞ _ NÞ accounts forthe fraction of N users who received multiple notificationsby time instant t, gðtÞ :¼ IðtÞ NT, where NT is the potential numberof users in the network who will ultimately becomeinterested in the content (NT ¼ 100 in Fig. 5), i.e., NT be themaximum expected level of the content cumulative interestin the private network.The above formula is a second order Bernoulli differentialequation and can be solved asIðtÞ ¼NT _ Ið0ÞIð0Þ þ ðNT  Ið0ÞÞe p_N_t ; (7)1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015where Ið0Þ be the number of interested users at time t ¼ 0.We note that IðtÞ has an S-shape (Fig. 5). It shows that thenumber of interested users increases quickly when the contentbecomes available and then gradually decreases andtapers off once the level of interest reaches NT . This is similarto the demand function that was obtained using wordof-mouth spread of information by interested users (Bassmodel). Similar interest evolution was also observed whenmeasuring user interest in a video file on YouTube server[25], and when measuring user interest in popular videohosted on a university infrastructure (CoralCDN) [26].Given the evolution of interest in a media content IðtÞ inEq. (7), we can now use fluid-flow model to write the rate atwhich downloading users are completely served (finishdownloading the media content) asdSðtÞdt ¼ mQ _ ½IðtÞ          SðtÞ_;where mQ is the required QoS streaming rate for everydownloading user (measured in bits/second), and SðtÞ isthe number of completely served users at time instant t. Theabove differential equation can be easily solved for SðtÞ.Hence, the expected value of demand for stream capacity ofthe content at any time t (measured in bits/second) isE½DðtÞ_ ¼dSðtÞdt ¼ mQ _ ½IðtÞ  SðtÞ_: (8)6.2 Evaluation of the Algorithm (PBRA) Proposedfor Reserving Resources in the CloudThe algorithm that we evaluate in this section is the veryfirst algorithm that was proposed in Section 4 for resourcereservation in the cloud. We used time-discount rates similarto those used in the pricing model employed by AmazonEC2 [6] in order to derive tariff functions that we used inour evaluations. Those tariffs are non-linear functions ofboth the amount of reserved resources and reservationtime. An example of a tariff function that we used in ourevaluations for units of reserved resources equal to 3 isdepicted in Fig. 6. Note that time discounts are given to thereserved resources. For example, we can see that if themedia content provider wants to reserve (prepaid purchase)3 units of streaming resources for 6 time units, then the tariffis 13 $ per unit of reserved time; whereas the tariff is 14:25 ifthe same amount of resources is reserved for only 1 timeunit. We consider a log-normal probability distribution ofthe demand for streaming capacity with mean (i.e., predicteddemand E½DðtÞ_) computed by Eq. (8) for IðtÞ givenin Fig. 5, mQ ¼ 1, and variance of 3.6.2.1 Performance versus ComplexityAs we have discussed in Section 4, our proposed algorithm(PBRA) employs a trial window wtry with size taking valuesin multiplicative order of wmin, where wmin can be definedas the granularity of the resource allocation in the cloud(i.e., it is the minimum reservation time that the cloud providerrequires for any amount of resource reserved in thecloud), and it is measured in units of time. To investigatethe impact of the value of wmin on the performance of ouralgorithm, we compared the financial cost of media streamingwhen using our algorithm for varied sizes of wmin ath ¼ 0:75. To plot the comparison figure, we computed theratio of the overall cost of resource reservation for everyvalue of wmin to the overall cost when using wmin ¼ 1 (i.e.,normalized cost) (Fig. 7).Fig. 5. The evolution of interest in the video channel.Fig. 6. A tariff function for units of reserved resources equal to 3.Fig. 7. Performance versus complexity of the PBRA algorithm forresource reservation in the cloud.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1029The results show that the algorithm provides the leastcost of resource allocation in the cloud when wmin ¼ 1.Hence, we can see that the finer granularity that we havein resource allocation in the cloud (i.e., the smaller value ofwmin), the better performance we get in our algorithm. Thebetter performance, however, comes at higher algorithmcomplexity, where complexity is measured in terms of totalnumber of iterations (h). We can see that h is higher forsmaller wmin (Fig. 7). However, even for the highest numberof iterations (when wmin ¼ 1), total CPU time was only1:02 second using Intel(R) Core(TM)2 Quad CPU @2.82 GHz. If we compare this execution time with theperiod of time over which the algorithm is operating0 _ tðsecÞ _ 1;000 (Fig. 5), we can see that our algorithmexecutes very efficiently.6.2.2 Comparison with Other Resource ProvisioningAlgorithmsRecall that our proposed algorithm for resource reservationin the cloud (PBRA) is based on windows with variablesizes (i.e., variable time slots as shown in Fig. 3). The sizeof every window and the amount of reserved resources inevery window is determined to minimize the financial coston the media content provider. We evaluate the performanceof our PBRA algorithm against two other resourceprovisioning schemes: fixed window size scheme (denotedby fixed-reserve-time), and the pay-as-you-go resourceallocation scheme which is widely used in the clouds(denoted by pay-as-you-go). The fixed window sizescheme is based on resource reservation wherein all timeslots(windows) are of the same size (i.e., wj is the same8j). The pay-as-you-go scheme is based on on-demandresource allocation wherein resources are allocated uponneeded. The price of reserved resources is less than the ondemandresources since time-discounted rates are onlygiven to the reserved resources.We computed the overall financial cost when using eachof the above schemes for resource allocation in the cloud.To plot the comparison figure, we computed the ratio ofthe overall cost for every value of wmin to the cost whenusing our PBRA algorithm with wmin ¼ 1 (Normalizedcost) (Fig. 8). In the case of Fixed-reserve-time, we set wjalways fixed as wj ¼ wmin 8j, and wj ¼ 10. We can see thatPBRA outperforms the Fixed-reserve-time scheme for allvalues of wmin. This is because PBRA selects window sizesaccording to the predicated demand such that the rightamount of resource is reserved in the cloud that maximallybenefits from the time-discount rates in the tariffs, andensures that reserved resources meet the actual demandwithout incurring wastage. PBRA also outperforms thepay-as-you-go scheme because it maximally benefits fromthe time-discounted rates given to the reserved resources,while no discount is given to resources allocated using theon-demand scheme.6.2.3 Impact of Different Probability Distributions of theDemandIn the next set of evaluations, we considered three log-normalprobability distribution functions for the demand withsame mean but varied variances. The mean of all log-normaldistributions E½DðtÞ_ is given in Eq. (8), where IðtÞ is givenin Fig. 5, mQ ¼ 1, while variances of the log-normal distributionswere set to 3, 6, and 8.The stochastic effect of demand on the cost of reservedresources using PBRA is shown in Table 2 when h ¼ 0:75.We observe that the overall resource reservation costincreases as the variance of the log-normal distributionincreases. This is because larger variance means higher likelihoodthat the reserved resources in the cloud do not meetthe actual demand. Consequently, higher reserved resourcesare required in the cloud to meet the actual demandgiven a certain probabilistic confidence h, which results inhigher cost for resource reservation in the cloud.6.3 Evaluation of the Hybrid Approach for ResourceAllocation in the CloudIn this section, we evaluate the performance of our hybridresource allocation algorithm proposed in Section 5. Ourhybrid approach enables the media content provider to efficientlyallocate resources in the cloud using both the reservationresource provisioning plan and the on-demandresource provisioning plan offered by the cloud provider.As we have discussed in Section 5, the right value ofparameter h has to be determined for this hybrid approachto optimally perform. To investigate the impact of differentvalues of h on the performance of the hybrid approach, weconsidered continuous non-linear tariffs that are functionsof both the allocated resources and reservation time. Weused time-discount rates similar to those used in the pricingmodel employed by Amazon EC2 [6] in order to derive tarifffunctions that we used in our evaluations. Time discountrates are only offered to reserved resources, while no timediscount rates are offered to resources allocated using theon-demand plan. An example of a tariff function that weFig. 8. Performance comparisons.TABLE 2Media Streaming Cost Given Different Probability Distributions of the Demand (in $)1030 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015used in our evaluations for units of allocated resourcesequal 3 is depicted in Fig. 6. Referring to Fig. 6, if the averageunits of resources allocated in the cloud for 6 time unitsusing the on-demand plan is 3, then the cost is 15 _ 6 ¼ $90;whereas if the media content provider reserves (prepaidpurchase) the same amount of resources for 6 time unitsusing the reservation plan, then the price charged is only13 _ 6 ¼ $78.In the next set of simulations, we consider a demandwith mean E½DðtÞ_ given in Eq. (8), where IðtÞ is given inFig. 5, mQ ¼ 1, and variance of 3. Recall that our hybridapproach selects the right value of h in every window. Inevery window j, different values of h are tested to selectsthe one that yields the least overall cost. Table 3 show thecost of resources allocated using both the resource reservationplan and resource on-demand plan when j ¼ 7 (correspondingto t ¼ 650), which results from using our hybridalgorithm. We observe that when h increases, the cost of theresources allocated using the reservation plan increases,while the cost of resources allocated using the on-demandplan decreases. This is because higher amount of reservedresources is required in the cloud for higher h and, consequently,less amount of on-demand resources is needed. Wealso observe that when h increases from 0:75 to 0:8 the overallcost (i.e., the cost of both reservation and on-demandresources) decreases; whereas when h increases beyond 0:8the overall cost increases. This is because the over-subscribed(over-provisioning) cost of the reserved resourcesbecomes very high when h > 0:8. We can see that the optimumvalue of h (i.e., the value of h that yields the least overallcost) when j ¼ 7 is about 0:8.To get a sense of how the optimal selection of thevalue of h can significantly reduce the overall monetarycost on the media content provider when using thishybrid streaming resource provisioning approach, let uscompare the total cost when using our hybrid resourceallocation algorithm at j ¼ 7 against two cases: the casewhen the media content provider uses the on-demandplan only (pay-as-you-go), and the case when the mediacontent provider uses the reservation plan only (fixedreserve-time). We observed that the cost of our hybridapproach when h_ ¼ 0:8 is $45;833; while the cost of allocatedresource in the case of pay-as-you-go is fixed atabout $52;000 (does not depend on the value of h), andthe cost of allocated resources in the case of fixed-reservetimewhen h ¼ 0:8 is about $48;000 (Fig. 9). Hence, ouralgorithm reduces the cost by an amount of about $6;200compared to pay-as-you-go (i.e., about 12 percent cost saving),and reduces the cost by an amount of $2;200 comparedto fixed-reserve-time (i.e., 4:5 percent cost saving).We note here that the cost was computed for onlyone video channel. However, a media content providergenerally offers hundreds of video channels to its clients.Therefore, the overall cost-saving using our proposedalgorithm can be significantly high for large number ofvideo channels offered by the media content provider.7 CONCLUSION AND FUTURE WORKThis paper studies the problem of resource allocations in thecloud for media streaming applications. We have considerednon-linear time-discount tariffs that a cloud providercharges for resources reserved in the cloud. We have proposedalgorithms that optimally determine both the amountof reserved resources in the cloud and their reservationtime—based on prediction of future demand for streamingcapacity—such that the financial cost on the media contentprovider is minimized. The proposed algorithms exploit thetime discounted rates in the tariffs, while ensuring that sufficientresources are reserved in the cloud without incurringwastage. We have evaluated the performance of our algorithmsnumerically and using simulations. The results showthat our algorithms adjust the tradeoff between resourcesreserved on the cloud and resources allocated on-demand.In future work, we shall perform experimental measurementsto characterize the streaming demand in the Internetand develop our own demand forecasting module. We shallalso investigate the case of multiple cloud providers andconsider the market competition when allocating resourcesin the clouds.ACKNOWLEDGMENTSThis work was supported by the National Center of Electronics,Communication, and Photonics at King AbdulazizCity for Science and Technology (Saudi Arabia). This paperwas based in part on a paper appeared in the proceeding ofthe IEEE Globecom 2012.TABLE 3Media Streaming Cost Using Two Resource Allocation Plans Provided by the Cloud (Hybrid Resource Provisioning Approach) (in $)Fig. 9. Hybrid approach performance comparisons.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS

Improving Web Navigation Usability by Comparing Actual and Anticipated Usage

We present a new method to identify navigation related Web usability problems based on comparing actual and anticipated usage patterns. The actual usage patterns can be extracted from Web server logs routinely recorded for operational websites by first processing the log data to identify users, user sessions, and user task-oriented transactions, and then applying a usage mining algorithm to discover patterns among actual usage paths. The anticipated usage, including information about both the path and time required for user-oriented tasks, is captured by our ideal user interactive path models constructed by cognitive experts based on their cognition of user behavior.

The comparison is performed via the mechanism of test MY SQL for checking results and identifying user navigation difficulties. The deviation data produced from this comparison can help us discover usability issues and suggest corrective actions to improve usability. A software tool was developed to automate a significant part of the activities involved. With an experiment on a small service-oriented website, we identified usability problems, which were cross-validated by domain experts, and quantified usability improvement by the higher task success rate and lower time and effort for given tasks after suggested corrections were implemented. This case study provides an initial validation of the applicability and effectiveness of our method.

1.2 INTRODUCTION

As the World Wide Web becomes prevalent today, building and ensuring easy-to-use Web systems is becoming a core competency for business survival. Usability is defined as the effectiveness, efficiency, and satisfaction with which specific users can complete specific tasks in a particular environment. Three basic Web design principles, i.e., structural firmness, functional convenience, and presentational delight, were identified to help improve users’ online experience. Structural firmness relates primarily to the characteristics that influence the website security and performance. Functional convenience refers to the availability of convenient characteristics, such as a site’s ease of use and ease of navigation, that help users’ interaction with the interface. Presentational delight refers to the website characteristics that stimulate users’ senses. Usability engineering provides methods for measuring usability and for addressing usability issues. Heuristic evaluation by experts and user-centered testing are typically used to identify usability issues and to ensure satisfactory usability.

However, significant challenges exist, including 1) accuracy of problem identification due to false alarms common in expert evaluation 2) unrealistic evaluation of usability due to differences between the testing environment and the actual usage environment, and 3) increased cost due to the prolonged evolution and maintenance cycles typical for many Web applications. On the other hand, log data routinely kept at Web servers represent actual usage. Such data have been used for usage-based testing and quality assurance and also for understanding user behavior and guiding user interface design.

Server-side logs can be automatically generated by Web servers, with each entry corresponding to a user request. By analyzing these logs, Web workload was characterized and used to suggest performance enhancements for Internet Web servers. Because of the vastly uneven Web traffic, massive user population, and diverse usage environment, coverage-based testing is insufficient to ensure the quality of Web applications. Therefore, server-side logs have been used to construct Web usage models for usage-based Web testing or to automatically generate test cases accordingly to improve test efficiency.

1.3 LITRATURE SURVEY

WEB USABILITY PROBE: A TOOL FOR SUPPORTING REMOTE USABILITY EVALUATION OF WEB SITES

PUBLICATION: Human-Computer Interaction—INTERACT 2011. New York, NY, USA: Springer, 2011,pp. 349–357.

AUTHORS: T. Carta, F. Patern`o, and V. F. D. Santana

EXPLANATION:

Usability evaluation of Web sites is still a difficult and time-consuming task, often performed manually. This paper presents a tool that supports remote usability evaluation of Web sites. The tool considers client-side data on user interactions and JavaScript events. In addition, it allows the definition of custom events, giving evaluators the flexibility to add specific events to be detected and considered in the evaluation. The tool supports evaluation of any Web site by exploiting a proxy-based architecture and enables the evaluator to perform a comparison between actual user behavior and an optimal sequence of actions.

SUPPORTING ACTIVITY MODELLING FROM ACTIVITY TRACES

PUBLICATION: Expert Syst., vol. 29, no. 3, pp. 261–275, 2012.

AUTHORS: O. L. Georgeon, A. Mille, T. Bellet, B. Mathern, and F. E. Ritter,

EXPLANATION:

We present a new method and tool for activity modelling through qualitative sequential data analysis. In particular, we address the question of constructing a symbolic abstract representation of an activity from an activity trace. We use knowledge engineering techniques to help the analyst build ontology of the activity, that is, a set of symbols and hierarchical semantics that supports the construction of activity models. The ontology construction is pragmatic, evolutionist and driven by the analyst in accordance with their modelling goals and their research questions. Our tool helps the analyst define transformation rules to process the raw trace into abstract traces based on the ontology. The analyst visualizes the abstract traces and iteratively tests the ontology, the transformation rules and the visualization format to confirm the models of activity. With this tool and this method, we found innovative ways to represent a car-driving activity at different levels of abstraction from activity traces collected from an instrumented vehicle. As examples, we report two new strategies of lane changing on motorways that we have found and modelled with this approach.

TOOLS FOR REMOTE USABILITY EVALUATION OF WEB APPLICATIONS THROUGH BROWSER LOGS AND TASK MODELS

PUBLICATION: Behavior Res.Methods, Instrum., Comput., vol. 35, no. 3, pp. 369–378, 2003

AUTHORS: L. Paganelli and F. Patern`o,

EXPLANATION:

The dissemination of Web applications is extensive and still growing. The great penetration of Web sites raises a number of challenges for usability evaluators. Video-based analysis can be rather expensive and may provide limited results. In this article, we discuss what information can be provided by automatic tools able to process the information contained in browser logs and task models. To this end, we present a tool that can be used to compare log files of user behavior with the task model representing the actual Web site design, in order to identify where users’ interactions deviate from those envisioned by the system design.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Previous studies usability has long been addressed and discussed, when people navigate the Web they often encounter a number of usability issues. This is also due to the fact that Web surfers often decide on the spur of the moment what to do and whether to continue to navigate in a Web site. Usability evaluation is thus an important phase in the deployment of Web applications. For this purpose automatic tools are very useful to gather larger amount of usability data and support their analysis.

Remote evaluation implies that users and evaluators are separated in time and/or space. This is important in order to analyse users in their daily environments and decreases the costs of the evaluation without requiring the use of specific laboratories and asking the users to move. In addition, tools for remote Web usability evaluation should be sufficiently general so that they can be used to analyse user behaviour even when using various browsers or applications developed using different toolkits. We prefer logging on the client-side in order to be able to capture any user-generated events, which can provide useful hints regarding possible usability problems.

Existing approaches have been used to support usability evaluation. An example was WebRemUsine, which was a tool for remote usability evaluation of Web applications through browser logs and task models. Propp and Frorbrig have used task models for supporting usability evaluation of a different type of application: cooperative behaviour of people interacting in smart environments. A different use of models is in the authors discuss how task models can enhance visualization of the usability test log. In our case we do not require the effort of developing models to apply our tool. We only require that the designer provides an example of optimal use associated with each of the relevant tasks. The tool will then compare the logs with the actual use with the optimal log in order to identify deviations, which may indicate potential usability problems.

2.1.1 DISADVANTAGES:

Web navigate used a logger to collect data from a user session test on a Web interface prototype running on a PDA simulator in order to evaluate different types of Web navigation tools and identify the best one for small display devices.

Users were asked to find the answer to specific questions using different types of navigation tools to move from one page to another. A database was used to store users’ actions, but they logged only the answer given by the user to each specific question. Moreover they stored separately every term searched by the user by means of the internal search tool.

Client-side data encounters different challenges regarding the identification of the elements that users are interacting with, how to manage element identification when the page is changed dynamically, how to manage data logging when users are going from one page to another, amongst others. The following are some of the solutions we adopted in order to deal with these issues.

2.2 PROPOSED SYSTEM:

We propose a new method to identify navigation related usability problems by comparing Web usage patterns extracted from server logs against anticipated usage represented in some cognitive user models (RQ2). Fig. 1 shows the architecture of our method. It includes three major modules: Usage Pattern Extraction, IUIP Modeling, and Usability Problem Identification. First, we extract actual navigation paths from server logs and discover patterns for some typical events. In parallel, we construct IUIP models for the same events. IUIP models are based on the cognition of user behavior and can represent anticipated paths for specific user-oriented tasks.

Our IUIP models are based on the cognitive models surveyed in Section II, particularly the ACT-R model. Due to the complexity of ACT-R model development and the low-level rule based programming language it relies on we constructed our own cognitive architecture and supporting tool based on the ideas from ACT-R. In general, the user behavior patterns can be traced with a sequence of states and transitions. Our IUIP consists of a number of states and transitions. For a particular goal, a sequence of related operation rules can be specified for a series of transitions. Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost.

Typically, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner. Based on this cognitive mechanism, IUIP models our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

2.2.1 ADVANTAGES:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated follow up pages will not be used themselves for deviation calculations to avoid double counting.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Back End                                :           MYSQL Server
  • Server                                      :           Apache Tomact Server
  • Script                                       :           JSP Script
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

IUIP MODELS:

Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. For example, human factors guidelines specify the upper bound for the response time to mitigate the risk that users will lose interest in a website. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner on this cognitive mechanism, IUIP models need to be constructed individually for novices and experts by cognitive experts by utilizing their domain expertise and their knowledge of different users’ interactive behavior.

We can adapt the durations by performing iterative tests with different users Diagrammatic notation methods and tools are often used to support interaction modeling and task performance evaluation IUIP model construction and reuse, we used C++ and XML to develop our IUIP modeling tool based on the open-source visual diagram software DIA. DIA allows users to draw customized diagrams, such as UML, data flow, and other diagrams. Existing shapes and lines in DIA form part of the graphic notations in our IUIP models. New ones can be easily added by writing simple XML files. The operations, operation rules, and computation rules can be embedded into the graphic notations with XML schema we defined to form our IUIP symbols. Currently, about 20 IUIP symbols have been created to represent typical Web interactions. IUIP symbols used in subsequent examples are explained at the bottom of cognitive experts can use our IUIP modeling tool to develop various IUIP models for different Web applications.

The actual users’ navigation trails we extracted from the aggregated trail tree are compared against corresponding IUIP models automatically. This comparison will yield a set of deviations between the two. We can identify some common problems of actual users’ interaction with the Web application by focusing on deviations that occur frequently. Combined with expertise in product internal and contextual information, our results can also help identify the root causes of some usability problems existing in the Web design. Based on logical choices made and time spent by users at each page, the calculation of deviations between actual users’ usage patterns and IUIP can be divided into two parts:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The IUIP model for the task “First Selection” is shown on the top. The corresponding user Trail 7, a part of a trail tree extracted from log data, is presented under it. The node in the tree is annotated with the number of users having reached the node across the same trail prefix. The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated followup pages will not be used themselves for deviation calculations to avoid double counting.

4.1 ALGORITHM

TRAIL TREE USAGE MINING ALGORITHM

The transactions identified from each user session form a collection of paths use the trie data structure to merge the paths along common prefixes. A trie, or a prefix tree, is an ordered tree used to store an associative array where the keys are usually strings. All the descendants of a node have a common prefix of the string associated with that node. The root is associated with the empty string. We adapted the trie algorithm to construct a tree structure that also captures user visit frequencies, which is called a trail tree in our work. In a trail tree, a complete path from the root to a leaf node is called a trail.

The leaf nodes of the trail tree are also annotated with the trail names. The transaction paths extracted from the Web server log are shown in the table to its left, together with path occurrence frequencies. Paths 1, 4, and 5 have the common first node a; therefore, they were merged together. For the second node of this subtree, Paths 1 and 4 both accessed Page b; therefore, the two paths were combined at Node b. Finally, Paths 1 and 4 were merged into a single trail, Trail 1, although Path 1 terminates at Node e. By the same method, the other paths can be integrated into the trail tree. The number at each edge indicates the number of users reaching the next node across the same trail prefix.

Based on the aggregated trail tree, further mining can be performed for some “interesting” pattern discovery. Typically, good mining results require a close interaction of the human experts to specify the characteristics that make navigation patterns interesting. In our method, we focus on the paths which are used by a sufficient number of users to finish a specific task. The paths can be initially prioritized by their usage frequencies and selected by using a threshold specified by the experts. Application-domain knowledge and contextual information, such as criticality of specific tasks, user privileges, etc., can also be used to identified “interesting” patterns. For the FG 2009 website, we extracted 30 trails each for Tasks 1, 2, and 3, and 5 trails for Task 4.

4.2 MODULES:

COGNITIVE USER MODEL:

WEB SERVER USER LOG:

USAGE PATTERN EXTRACTION:

USABILITY MEASURING:

4.3 MODULE DESCRIPTION:

COGNITIVE USER MODEL:

User Models is a growing need to incorporate insights from cognitive science about the mechanisms, strengths, and limits of human perception and cognition to understand the human factors involved in user interface design in the various constraints on cognition (e.g., system complexity) and the mechanisms and patterns of strategy selection can help human factor engineers develop solutions and apply technologies that are better suited to human abilities.

Commonly used cognitive models include GOMS, EPIC, and ACT-R. The GOMS model consists of Goals, Operators, Methods, and Selection rules. As the high-level architecture, GOMS describes behavior and defines interactions as a static sequence of human actions. As the low-level cognitive architecture, EPIC (Executive-Process/Interactive Control) and ACT-R (Adaptive Control of Thought-Rational) can be taken as the specific implementation of the high-level architecture.

They provide detailed information about how to simulate human processing and cognition important feature of these low-level cognitive architectures is that they are all implemented as computer programming systems so that cognitive models may be specified, executed, and their outputs (e.g., error rates and response latencies) compared with human performance data.

WEB SERVER USER LOG:

Server logs have also been used by organizations to learn about the usability of their products. For example, search queries can be extracted from server logs to discover user information needs for usability task analysis. There are many advantages to using server logs for usability studies. Logs can provide insight into real users performing actual tasks in natural working conditions versus in an artificial setting of a lab. Logs also represent the activities of many users over a long period of time versus the small sample of users in a short time span in typical lab testing. Data preparation techniques and algorithms can be used to process the raw Web server logs, and then mining can be performed to discover users’ visitation patterns for further usability analysis.

For example, organizations can mine server-side logs to predict users’ behavior and context to satisfy users’ revisitiation patterns can be discovered by mining server logs to develop guidelines for browser history mechanism that can be used to reduce users’ cognitive and physical effort Client-side logs can capture accurate comprehensive usage data for usability analysis, because they allow low-level user interaction events such as keystrokes and mouse movements to be recorded.

For example, using these client-side data, the evaluator can accurately measure time spent on particular tasks or pages as well as study the use of “back” button and user click streams. Such data are often used with task based approaches and models for usability analysis by comparing discrepancies between the designer’s anticipation and a user’s actual behavior. However, the evaluator must program the UI, modify Web pages, or use an instrumented browser with plug-in tools or a special proxy server to collect such data.

USAGE PATTERN EXTRACTION:

Web server logs are our data source. Each entry in a log contains the IP address of the originating host, the timestamp, the requested Web page, the referrer, the user agent and other data. Typically, the raw data need to be preprocessed and converted into user sessions and transactions to extract usage patterns.

The data preparation and preprocessing include the following domain-dependent tasks.

1) Data cleaning: This task is usually site-specific and involves removing extraneous references to style files, graphics, or sound files that may not be important for the purpose of our analysis.

2) User identification: The remaining entries are grouped by individual users. Because no user authentication and cookie information is available in most server logs, we used the combination of IP, user agent, and referrer fields to identify unique users.

3) User session identification: The activity record of each user is segmented into sessions, with each representing a single visit to a site. Without additional authentication information from users and without the mechanisms such as embedded session IDs, one must rely on heuristics for session identification. For example, we set an elapse time of 15 min between two successive page accesses as a threshold to partition a user activity record into different sessions.

4) Path completion: Client or proxy side caching can often result in missing access references to some pages that have been cached. These missing references can often be heuristically inferred from the knowledge of site topology and referrer information, along with temporal information from server logs.

These tasks are time consuming and computationally intensive, but essential to the successful discovery of usage patterns.

We developed a tool to automate all these tasks except part of path completion. For path completion, the designers or developers first need to manually discover the rules of missing references based on site structure, referrer, and other heuristic information. Once the repeated patterns are identified, this work can be automatically carried out. Our tool can work with server logs of different Web applications by modifying the related parameters in the configuration file. The processed log data are stored into a database for further use.

USABILITY MEASURING:

Our specific results from applying our method to the FG 2009 website we collected Web server access log data for the first three days after its deployment. The server log includes about above 500 entries. After preprocessing the raw log data using our tool, we identified 58 unique users and 81 sessions. Then, we constructed four event models for four typical tasks. We extracted 95 trails for these tasks. Meanwhile, a designer with three-year GUI design experience and an expert with five-year experience with human factors practice for the Web constructed four IUIP models for the same tasks based on their cognition of users’ interactive behavior. By checking the extracted usage patterns against the four IUIP models, we obtained logical and temporal deviations shown in Tables I and II and identified 17 usability issues or potential usability problems. Some usability issues were identified by both logical and temporal deviation analyses. Next, we further analyze these deviations for usability problem identification and improvement.

In Table I, 16 deviations took place in the page “index.php.” The unanticipated followup page is the page “login.php,” followed by the page “index.php?f=t” (login failure). Further reviewing the index page, we found that the page design is too simplistic: No instruction was provided to help users to login or register. We inferred that some users with limited online shopping experience were trying to use their regular email addresses and passwords to log in to the FG 2009 website.

We also found some structure design issues. For example, we observed that some users repeatedly visited the page “Selection Rules.” It is likely that when the users were not permitted to select any furniture in some categories (the FG website limited each user to select one piece of furniture under each category), they had to go to the page “Selection Rules” to find the reasons. To reduce these redundant operations and improve user experience, the help function for selection rules should be redesigned to make it more convenient for users to consult.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program
Compilers
Interpreter
My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 8

8.1 CONCLUSION:

We have developed a new method for the identification and improvement of navigation-related Web usability problems by checking extracted usage patterns against cognitive user models. As demonstrated by our case study, our method can identify areas with usability issues to help improve the usability of Web systems. Once a website is operational, our method can be continuously applied and drive ongoing refinements. In contrast with traditional software products and systems, Web based applications have shortened development cycles and prolonged maintenance cycles. Our method can contribute significantly to continuous usability improvement over these prolonged maintenance cycles. The usability improvement in successive iterations can be quantified by the progressively better effectiveness (higher task completion rate) and efficiency (less time for given tasks).

Our method is not intended to and cannot replace heuristic usability evaluation by experts and user-centered usability testing. It complements these traditional usability practices and can be incorporated into an integrated strategy for Web usability assurance. With automated tool support for a significant part of the activities involved, our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

8.2 FUTURE ENHANCEMENT: In the future, we should and must carry out validation studies with large-scale Web applications. We also plan to explore additional approaches to discover Web usage patterns and related usability problems generalizable to other interesting domains. For example, we have already started exploring deviation calculation and analysis at the trail level instead of at the individual page level. Such analyses might be more meaningful and yield more interesting results for Web applications with complex structure and operation sequences. Our IUIP modeling architecture and supporting tools also need to be further enhanced and optimized for more complex tasks. We will also further expand our usability research to cover more usability aspects to improve Web users’ overall satisfaction.

Improving Physical-Layer Security in Wireless Communications Using Diversity Techniques

n wireless networks, transmission between legitimateusers can easily be overheard by an eavesdropper forinterception due to the broadcast nature of the wirelessmedium, making wireless transmission highly vulnerableto eavesdropping attacks. In order to achieve confidentialtransmission, existing communications systems typically adoptthe cryptographic techniques to prevent an eavesdropper fromtapping data transmission between legitimate users [1, 2]. Byconsidering symmetric key encryption as an example, the originaldata (called plaintext) is first encrypted at the sourcenode by using an encryption algorithm along with a secret keythat is shared only with the destination node. Then theencrypted plaintext (also known as ciphertext) is transmittedto the destination, which will decrypt its received ciphertextwith the preshared secret key. In this way, even if an eavesdropperoverhears the ciphertext transmission, it is still difficultfor the eavesdropper to interpret the plaintext from itsintercepted ciphertext without the secret key. It is pointed outthat ciphertext transmission is not perfectly secure, since theciphertext can still be decrypted by an eavesdropper throughan exhaustive key search, which is also known as a brute-forceattack. To this end, physical-layer security is emerging as analternative paradigm to protect wireless communicationsagainst eavesdropping attacks, including brute-force attacks.Physical-layer security work was pioneered by Wyner in [3],where a discrete memoryless wiretap channel was examinedfor secure communications in the presence of an eavesdropper.It was proved in [3] that perfectly secure data transmissioncan be achieved if the channel capacity of the main link(from source to destination) is higher than that of the wiretaplink (from source to eavesdropper). Later on, in [4], Wyner’sresults were extended from the discrete memoryless wiretapchannel to the Gaussian wiretap channel, where a so-calledsecrecy capacity was developed, and shown as the differencebetween the channel capacity of the main link and that of thewiretap link. If the secrecy capacity falls below zero, the transmissionfrom source to destination becomes insecure, and theeavesdropper can succeed in intercepting the source transmission(i.e., an intercept event occurs). In order to improvetransmission security against eavesdropping attacks, it is ofimportance to reduce the probability of occurrence of anintercept event (called intercept probability) through enlargingsecrecy capacity. However, in wireless communications, secrecycapacity is severely degraded due to the fading effect.42 IEEE Network • January/February 2015AbstractDue to the broadcast nature of radio propagation, wireless transmission can bereadily overheard by unauthorized users for interception purposes and is thus highlyvulnerable to eavesdropping attacks. To this end, physical-layer security isemerging as a promising paradigm to protect the wireless communications againsteavesdropping attacks by exploiting the physical characteristics of wireless channels.This article is focused on the investigation of diversity techniques to improvephysical-layer security differently from the conventional artificial noise generationand beamforming techniques, which typically consume additional power for generatingartificial noise and exhibit high implementation complexity for beamformerdesign. We present several diversity approaches to improve wireless physical-layersecurity, including multiple-input multiple-output (MIMO), multiuser diversity, andcooperative diversity. To illustrate the security improvement through diversity, wepropose a case study of exploiting cooperative relays to assist the signal transmissionfrom source to destination while defending against eavesdropping attacks.We evaluate the security performance of cooperative relay transmission in Rayleighfading environments in terms of secrecy capacity and intercept probability. It isshown that as the number of relays increases, both the secrecy capacity and interceptprobability of cooperative relay transmission improve significantly, implyingthere is an advantage in exploiting cooperative diversity to improve physical-layersecurity against eavesdropping attacks.Improving Physical-Layer Security inWireless CommunicationsUsing Diversity TechniquesYulong Zou, Jia Zhu, Xianbin Wang, and Victor C.M. LeungI0890-8044/15/$25.00 © 2015 IEEEYulong Zou and Jia Zhu are with the Nanjing University of Posts andTelecommunications.Xianbin Wang is with the University of Western Ontario.Victor C.M. Leung is with the University of British Columbia.As a consequence, there are extensive works aimed atincreasing the secrecy capacity of wireless communications byexploiting multiple antennas [5] and cooperative relays [6].Specifically, the multiple-input multiple-output (MIMO) wiretapchannel was studied in [7] to enhance the wireless secrecycapacity in fading environments. In [8], cooperative relayswere examined for improving the physical-layer security interms of the secrecy rate performance. A hybrid cooperativebeamforming and jamming approach was investigated in [9] toenhance the wireless secrecy capacity, where partial relaynodes are allowed to assist the source transmission to thelegitimate destination with the aid of distributed beamforming,while the remaining relay nodes are used to transmit artificialnoise to confuse the eavesdropper. More recently, ajoint physical-application layer security framework was proposedin [10] for improving the security of wireless multimediadelivery by simultaneously exploiting physical-layer signalprocessing techniques as well as upper-layer authenticationand watermarking methods. In [11], error control coding forsecrecy was discussed for achieving the physical-layer security.Additionally, in [12, 13], physical-layer security was furtherinvestigated in emerging cognitive radio networks.At the time of writing, most research efforts are devoted toexamining the artificial noise and beamforming techniques tocombat eavesdropping attacks, but they consume additionalpower resources to generating artificial noise and increase thecomputational complexity in performing beamformer design.Therefore, this article is motivated to enhance the physicallayersecurity through diversity techniques without additionalpower costs, including MIMO, multiuser diversity, and cooperativediversity, aimed at increasing the capacity of the mainchannel while degrading the wiretap channel. For illustrationpurposes, we present a case study of exploiting cooperativerelays to improve the physical-layer security against eavesdroppingattacks, where the best relay is selected and used toparticipate in forwarding the signal transmission from sourceto destination. We evaluate the secrecy capacity and interceptprobability of the proposed cooperative relay transmission inRayleigh fading environments. It is shown that with anincreasing number of relays, the security performance ofcooperative relay transmission significantly improves in termsof secrecy capacity and intercept probability. This confirmsthe advantage of using cooperative relays to protect wirelesscommunications against eavesdropping attacks.The remainder of this article is organized as follows. Thenext section presents the system model of physical-layer securityin wireless communications. After that, we focus on thephysical-layer security enhancement through diversity techniques,including MIMO, multiuser diversity, and cooperativediversity. For the purpose of illustrating the security improvementthrough diversity, we present a case study of exploitingcooperative relays to assist signal transmission from source todestination against eavesdropping attacks. Finally, we providesome concluding remarks.Physical-Layer Security in WirelessCommunicationsFigure 1 shows a wireless communications scenario with onesource and one destination in the presence of an eavesdropper,where the solid and dashed lines represent the mainchannel (from source to destination) and the wiretap channel(from source to eavesdropper), respectively. When the sourcenode transmits its signal to the destination, an eavesdroppermay overhear such transmission due to the broadcast natureof the wireless medium. Considering the fact that today’swireless systems are highly standardized, the eavesdropper canreadily obtain the transmission parameters, including the signalwaveform, coding and modulation scheme, encryptionalgorithm, and so on. Also, the secret key may be figured outat the eavesdropper (e.g., through an exhaustive search).Thus, the source signal could be interpreted at the eavesdropperby decoding its overheard signal, leading to insecurity ofthe legitimate transmission.As a result, physical-layer security emerges as an alternativemeans to achieve perfect transmission secrecy from source todestination. In the physical-layer security literature [3, 4], aso-called secrecy capacity is developed and shown as the differencebetween the capacities of the main link and the wiretaplink. It has been proven that perfect secrecy is achieved if thesecrecy capacity is positive, meaning that when the main channelcapacity is larger than the wiretap channel capacity, thetransmission from source to destination can be perfectlysecure. This can be explained by using the Shannon codingtheorem from which it is impossible for a receiver to recoverthe source signal if the channel capacity (from source toreceiver) is smaller than the data rate. Thus, given a positivesecrecy capacity, the data rate can be adjusted between thecapacities of the main and wiretap channels so that the destinationnode successfully decodes the source signal and theeavesdropper fails to decode it. However, if the secrecy capacityis negative (i.e., the main channel capacity falls below thewiretap channel capacity), the eavesdropper is more likelythan the destination to succeed in decoding the source signal.In an information-theoretic sense, when the main channelcapacity becomes smaller than the wiretap channel capacity, itis impossible to guarantee that the destination succeeds andthe eavesdropper fails to decode the source signal. Therefore,an intercept event is seen to occur when the secrecy capacityfalls below zero, and the probability of occurrence of an interceptevent is called intercept probability throughout this article.At present, most existing work is focused on improvingphysical-layer security by generating artificial noise to confusean eavesdropping attack, where the artificial noise is sophisticatedlyproduced such that only the eavesdropper experiencesinterference, and the desired destination can easily cancel outsuch noise without performance degradation. More specifically,given a main channel matrix Hm, the artificial noise (denot-IEEE Network • January/February 2015 43Figure 1. A wireless communications scenario consisting ofone source and one destination in the presence of an eavesdroppingattack.Main linkWiretap linkDestinationDEavesdropperESourceSed by wn) is designed in the null space of matrix Hm such thatHmwn = 0, making the desired destination unaffected by thenoise. Since the wiretap channel is independent of the mainchannel, the null space of the wiretap channel is in generaldifferent from that of the main channel; thus, the eavesdroppercannot null out the artificial noise, which results in theperformance degradation at the eavesdropper. Notice that theabove-mentioned null space based noise generation approachneeds the knowledge of main channel Hm only, which can befurther optimized if the wiretap channel information is alsoknown. It needs to be pointed out that additional powerresources are required for generating artificial noise to confusethe eavesdropper. For a fair comparison, the total transmitpower of artificial noise and desired signal should beconstrained. Also, the power allocation between the artificialnoise and desired signal is important, and should be adaptedto the main and wiretap channels to optimize the physicallayersecurity performance, for example, in terms of secrecycapacity. Different from the artificial noise generationapproach, this article is mainly focused on the investigation ofdiversity techniques for enhancing physical-layer security.Diversity for Physical-Layer SecurityIn this section, we present several diversity techniques toimprove physical-layer security against eavesdropping attacks.Traditionally, diversity techniques are exploited to increasetransmission reliability, which also have great potential toenhance wireless security. In the following, we discuss thephysical-layer security improvement through the use ofMIMO, multiuser diversity, and cooperative diversity, respectively.Notice that the MIMO and multiuser diversity mechanismsare generally applicable to various cellular and WiFinetworks, since the cellular and WiFi networks typically consistof multiple users, and, moreover, today’s cellular andWiFi devices are equipped with multiple antennas. In contrast,the cooperative diversity mechanism is only applicableto some advanced cellular and WiFi networks that haveadopted the relay architecture, such as the Long Term Evolution(LTE)-Advanced system and IEEE 802.16j/m, whererelay stations are introduced to assist wireless data transmission.MIMO DiversityThis subsection presents MIMO diversity for physical-layersecurity of wireless transmission against eavesdroppingattacks. As shown in Fig. 2, all the network nodes areequipped with multiple antennas, where M, Nd, and Ne representthe number of antennas at the source, destination, andeavesdropper, respectively. As is known, MIMO has beenshown as an effective means to combat wireless fading andincrease the capacity of the wireless channel. However, theeavesdropper can also exploit the MIMO structure to enlargethe capacity of a wiretap channel from the source to theeavesdropper. Thus, without proper design, increasing thesecrecy capacity of wireless transmission with MIMO may fail.For example, if conventional open-loop space-time block codingis considered, the destination should first estimate themain channel matrix Hm and then perform the space-timedecoding process with an estimated H^m, leading diversity gainto be achieved for the main channel. Similarly, the eavesdroppercan also estimate the wiretap channel matrix Hw and thenconduct the corresponding space-time decoding algorithm toobtain diversity gain for the wiretap channel. Hence, the conventionalspace-time block coding is not effective to improvephysical-layer security against eavesdropping attacks.Generally speaking, if the source node transmits its signalto the desired destination with M antennas, the eavesdropperalso receives M signal copies for interception purposes. Inorder to defend against eavesdropping attacks, the sourcenode should adopt a preprocess that needs to be adapted tothe main and wiretap channels Hm and Hw such that diversitygain can be achieved only at the destination, whereas theeavesdropper benefits nothing from the multiple transmitantennas at the source. This means that an adaptive transmitprocess should be included at the source node to increase themain channel capacity while decreasing the wiretap channelcapacity. Ideally, the objective of such an adaptive transmitprocess is to maximize the secrecy capacity of MIMO transmission,which, however, requires the channel state information(CSI) of both the main and wiretap links (i.e., Hm andHw). In practice, the wiretap channel information Hw may beunavailable, since the eavesdropper is usually passive andstays silent. If only the main channel information Hm isknown, the adaptive transmit process can be designed to maximizethe main channel capacity, which does not requireknowledge of wiretap channel Hw. Since the adaptive transmitprocess is optimized based on the main channel informationHm, and the wiretap channel is typically independent of themain channel, the main channel capacity is significantlyincreased with MIMO, and no improvement is achieved forthe wiretap channel capacity.As for the aforementioned adaptive transmit process, wehere present three main concrete approaches: transmit beamforming,power allocation, and transmit antenna selection.Transmit beamforming is a signal processing technique com-44 IEEE Network • January/February 2015Figure 2. A MIMO wireless system consisting of one sourceand one destination in the presence of an eavesdroppingattack.D(Nd)…DestinationE(Ne)…EavesdropperS(M)…SourceDesired linkWiretap linkbining multiple transmit antennas at the source node in such away that desired signals transmit in a particular direction tothe destination. Considering that the eavesdropper and destinationgenerally lie in different directions relative to thesource node, the desired signals (with transmit beamforming)received at the eavesdropper experience destructive interferenceand become very weak. Thus, transmit beamforming iseffective in defending against eavesdropping attacks when thedestination and eavesdropper are spatially separated. Thepower allocation maximizes the main channel capacity (orsecrecy capacity if both Hm and Hw are known) by allocatingthe transmit power among M antennas at the source. In thisway, the secrecy capacity of MIMO transmission is significantlyincreased, showing the security benefits of using power allocationagainst eavesdropping attacks. In addition, the transmitantenna selection is also able to improve the physical-layersecurity of MIMO wireless systems. Depending on whetherthe global CSI of the main and wiretap channels (i.e., Hm andHw) is available, an optimal transmit antenna at the sourcenode is selected and used to transmit source signals. Morespecifically, if both Hm and Hw are available, the transmitantenna with the highest secrecy capacity is chosen. Studyingthe case of the global available CSI provides a theoreticalupper bound on the security performance of wireless systems.Notice that the CSI of wiretap channels may be estimated andobtained by monitoring the eavesdroppers’ transmissions asdiscussed in [8] and [14]. If only Hm is known, the transmitantenna selection is to maximize the main channel capacity.One can observe that the above-mentioned three approaches(i.e., transmit beamforming, power allocation, and transmitantenna selection) all have great potential to improve thephysical-layer security of MIMO wireless systems againsteavesdropping attacks.Multiuser DiversityThis subsection discusses the multiuser diversity for improvingphysical-layer security. Figure 3 shows that a base station (BS)serves multiple users where M users are denoted by U = {Ui|i= 1, 2, ···, M}. In cellular networks, M users typically communicatewith a BS through an orthogonal multiple access mechanismsuch as orthogonal frequency-division multiple access(OFDMA) and time-division multiple access (TDMA). TakingOFDMA as an example, orthogonal frequency-divisionmultiplexing (OFDM) subcarriers are allocated to differentusers. In other words, given an OFDM subcarrier, we need todetermine which user should be assigned to access and usethe subcarrier for data transmission. Traditionally, the userwith the highest throughput is selected to access the givenOFDM subcarrier, aiming to maximize the transmissioncapacity. This relies on knowledge of main channel informationHm only and can provide significant multiuser diversitygain for performance improvement. However, if a user is faraway from a BS and experiences severe propagation loss anddeep fading, it may have no chance of being selected as the“best” user for channel access. To this end, user fairnessshould be further considered in multiuser scheduling, wheretwo competing interests need to be balanced: maximizing themain channel capacity while at the same time guaranteeingeach user with certain opportunities to access the channel.With multiuser scheduling, a user is first selected to accessa channel (i.e., an OFDM subcarrier in OFDMA or a timeslot in TDMA) and then starts transmitting its signal to a BS.Meanwhile, due to the broadcast nature of wireless transmission,an eavesdropper overhears such transmission andattempts to interpret the source signal. In order to effectivelydefend against the eavesdropping attack, multiuser schedulingshould be performed to minimize the wiretap channel capacitywhile maximizing the main channel capacity, which requiresthe CSI of both the main and wiretap links. If only the mainchannel information Hm is available, we may consider the useof conventional multiuser scheduling where the wiretap channelinformation Hw is not taken into account. It needs to bepointed out that conventional multiuser scheduling still hasgreat potential to enhance physical-layer security, since themain channel capacity is significantly improved with conventionalmultiuser scheduling while the wiretap channel capacityremains the same.Cooperative DiversityIn this subsection, we focus mainly on cooperative diversityfor wireless security against eavesdropping attacks. Figure 4shows a cooperative wireless network including one source, Mrelays, and one destination in the presence of an eavesdropper,where M relays are exploited to assist the signal transmissionfrom source to destination. To be specific, the sourcenode first transmits its signal to M relays, which then forwardtheir received source signals to the destination. At present,there are two basic relay protocols: amplify-and-forward (AF)and decode-and-forward (DF). In the AF protocol, a relaynode simply amplifies and retransmits its received noisy versionof the source signal to the destination. In contrast, theDF protocol requires the relay node to decode its receivedsignal and forward its decoded outcome to the destinationnode. It is concluded that multiple-relay-assisted source signaltransmission consists of two steps:1. The source node broadcasts its signal.2. Relay nodes retransmit their received signals.Each of the two transmission steps is vulnerable to eavesdroppingattack and needs to be carefully designed to prevent aneavesdropper from intercepting the source signal.Typically, the main channel capacity with multiple relayscan be significantly increased by using cooperative beamforming.More specifically, multiple relays can form a virtualantenna array and cooperate with each other to performtransmit beamforming such that the signals received at theintended destination experience constructive interference,while the others (received at the eavesdropper) experiencedestructive interference. One can observe that with cooperativebeamforming, the received signal strength of the destina-IEEE Network • January/February 2015 45Figure 3. A multiuser wireless communications system consistingof one base station (BS) and multiple users in the presenceof an eavesdropper….EBSU1U2UMDesired linkWiretap linktion is much higher than that of the eavesdropper, implyingphysical-layer security improvement. In addition to the aforementionedcooperative beamforming, the best relay selectionis another approach to improve wireless transmission securityagainst eavesdropping attacks. In the best relay selection, arelay node with the highest secrecy capacity (or highest mainchannel capacity if only the main channel information is available)is chosen to participate in assisting the signal transmissionfrom source to destination. In this way, cooperativediversity gain is achieved for physical-layer security enhancement.Case Study: Security Evaluation ofCooperative Relay TransmissionIn this section, we present a case study to show the physicallayersecurity improvement by exploiting cooperative relays,where only a single best relay is selected to assist the signaltransmission from source to destination. This differs fromexisting research efforts in [8], where multiple cooperativerelays participate in forwarding the source signal to the destination.For comparison purposes, we first consider conventionaldirect transmission as a benchmark scheme, where thesource node directly transmits its signal to the destinationwithout relay. Meanwhile, an eavesdropper is present andattempts to intercept the signal transmission from source todestination. As discussed in [3, 4], the secrecy capacity of conventionaldirect transmission is shown as the differencebetween the capacities of the main channel (from source todestination) and the wiretap channel (from source to eavesdropper),which is written as(1)where P is the transmit power at the source, N0 is the varianceof additive white Gaussian noise (AWGN), gs = P/N0 isregarded as the signal-to-noise ratio (SNR), and hsd and hserepresent fading coefficients of the channel from source todestination and from source to eavesdropper, respectively.Presently, there are three commonly used fading models (i.e.,Rayleigh, Rician, and Nakagami), and we consider the use oftheRayleigh fading model to characterize the main and wiretapchannels. Thus, |hsd|2 and |hse|2 are independent andexponentially distributed random variables with means ssd2and sse2 , respectively. Also, an ergodic secrecy capacity of thedirect transmission can be obtained by averaging the instantaneoussecrecy capacity Cs+ over the fading coefficients hsd andhse, where Cs+ = max (Cs,0). In addition, if the secrecy capacityCs falls below zero, the source transmission becomes insecure,and the eavesdropper will succeed in intercepting thesource signal. Thus, using Eq. 1 and denoting x = |hsd|2 and y= |hse|2, an intercept probability of the direct transmissioncan be given by(2)where the third equation arises from the fact that randomvariables |hsd|2 and |hse|2 are independent exponentially distributed,and ssd2 and sse2 are the expected values of |hsd|2and |hse|2, respectively. As can be observed from Eq. 2, theintercept probability of conventional direct transmission isindependent of the transmit power P, meaning that increasingthe transmit power cannot improve physical-layer security interms of intercept probability. This motivates us to explorethe use of cooperative relays to decrease the intercept probability.For notational convenience, let lme represent the ratioof average main channel gain ssd2 to an eavesdropper’s averagechannel gain sse2 , that is, lme = ssd2 /sse2 , which is referredto as the main-to-eavesdropper ratio (MER) throughout thisarticle. In the following, we present the cooperative relaytransmission scheme where multiple relays are used to assistthe signal transmission from source to destination. Here, theAF relaying protocol is considered, and only the best relaywill be selected to participate in forwarding the source signalto the destination. To be specific, the source node first broadcastsits signal to M relays. Then the best relay node is chosento forward a scaled version of its received signal to the destination[15]. Note that during the above mentioned cooperativerelay transmission process, the total amount of transmitpower at source and relay should be constrained to P to makea fair comparison with the conventional direct transmissionscheme. We here consider the equal power allocation; thus,the transmit power at the source and relay is given by P/2.Now, given M relays, it is crucial to determine which relayshould be selected as the best one to assist the source signaltransmission. Ideally, the best relay selection should aim tomaximize the secrecy capacity, which, however, requires theCSI of both the main and wiretap channels. Since the eavesdropperis passive, and the wiretap channel information is difficultto obtain in practice, we consider the main channelcapacity as the objective of best relay selection, which relieson knowledge of the main channel only. Accordingly, the bestrelay selection criterion with AF protocol is expressed as(3)∫∫( )= <<ó ó ó ó⎛⎝ ⎜⎞⎠ ⎟óó + ó<P Ch hx ydx dyPr ( 0)= Pr=1exp – –= ,ssd sex y sd se sd sesesd seintercept2 22 2 2 222 2= +⎛⎝ ⎜⎜⎞⎠ ⎟⎟+⎛⎝ ⎜⎜⎞⎠ ⎟⎟CP hNP hNs log 1 – log 1 ,sd se220220R=∈ +h hh hBest Relay argmax ,isi idsi id2 22 246 IEEE Network • January/February 2015Figure 4. A cooperative diversity system consisting of onesource, M relays, and one destination in the presence of aneavesdropper….RelaysEEavesdropperR1R2RMDDestinationSSourcewhere R denotes a set of M relays, and |hsi|2 and |hid|2 representfading coefficients of the channel from source to relayRi and that from relay Ri to destination, respectively. One cansee from Eq. 3 that the proposed best relay selection criteriononly requires the main channel information, |hsi|2 and |hid|2,with which the main channel capacity is maximized. Since themain and wiretap channels are independent of each other, thewiretap channel capacity will benefit nothing from the proposedbest relay selection. Similar to Eq. 1, the secrecy capacityof best relay selection can be obtained through subtractingthe main channel capacity from the corresponding wiretapchannel capacity. Also, the intercept probability of best relayselection is easily determined by computing the probabilitythat the secrecy capacity becomes less than zero.In Fig. 5, we provide the ergodic secrecy capacity comparisonbetween the conventional direct transmission and proposedbest relay selection schemes for different numbers ofrelays M with gs = 12 dB, ssd2 = 0.5, and ssr2 = srd2 = 2. It isshown in Fig. 5 that for the cases of M = 2, M = 4, and M =8, the ergodic secrecy capacity of the best relay selectionscheme is always higher than that of direct transmission,showing the wireless security benefits of using cooperativerelays. Also, as the number of relays M increases from M = 2to M = 8, the ergodic secrecy capacity of best relay selectionsignificantly increases. This means that increasing the numberof cooperative relays can improve the physical-layer securityof wireless transmission against eavesdropping attacks.Figure 6 shows the intercept probability vs. MER of theconventional direct transmission and proposed best relayselection schemes for different numbers of relays M with gs =12 dB, ssd2 = 0.5, and ssr2 = srd2 = 2. Note that the interceptprobability is obtained by calculating the rate of occurrence ofan intercept event when the capacity of the main channel fallsbelow that of the wiretap channel. Observe from Fig. 6 thatthe best relay selection scheme outperforms conventionaldirect transmission in terms of intercept probability. Moreover,as the number of cooperative relays M increases from M= 2 to M = 8, the intercept probability improvement of bestrelay selection over direct transmission becomes much moresignificant. It is also shown from Fig. 6 that the slope of theintercept probability curve of the best relay selection schemein high MER regions becomes steeper with an increasingnumber of relays. In other words, as the number of relaysincreases, the intercept probability of best relay selectiondecreases at a much higher speed with an increasing MER.This further confirms that the diversity gain is achieved by theproposed relay selection scheme for physical-layer securityimprovement.ConclusionThis article studies physical-layer security of wireless communicationsand presents several diversity techniques for improvingwireless security against eavesdroping attacks. We discussthe use of MIMO, multiuser diversity, and cooperative diversityfor the sake of increasing the secrecy capacity of wirelesstransmission. To illustrate the security benefits through diversity,we propose a case study of physical-layer security incooperative wireless networks with multiple relays, where thebest relay is selected to participate in forwarding the signaltransmission from source to destination. The secrecy capacityand intercept probability of the conventional direct transmissionand proposed best relay selection schemes are evaluatedin Rayleigh fading environments. It is shown that the bestrelay selection scheme outperforms direct transmission interms of both secrecy capacity and intercept probability.Moreover, as the number of cooperative relays increases, thesecurity improvement of the best relay selection scheme overdirect transmission becomes much more significant.Although extensive research efforts have been devoted towireless physical-layer security, many challenging but interestingissues remain open for future work. Specifically, most ofthe existing works in this subject are focused on enhancing thewireless secrecy capacity against the eavesdropping attackonly, but have neglected the joint consideration of differenttypes of wireless physical-layer attacks, including both eavesdroppingand denial of service (DoS) attacks. It is of greatimportance to explore new techniques of jointly defendingagainst multiple different wireless attacks. Furthermore, security,reliability, and throughput are the main driving factorsfor the research and development of next-generation wirelessnetworks, which are typically coupled and affect each other.For example, the security of the wireless physical layer may beimproved by generating artificial noise to confuse an eavesdroppingattack, which, however, comes at the expense ofdegrading wireless reliability and throughput performance,since artificial noise generation consumes some powerIEEE Network • January/February 2015 47Figure 5. Ergodic secrecy capacity vs. MER of the direct transmissionand best relay selection schemes with gs = 12 dB,ssd2 = 0.5, and ssr2 = srd2 = 2.Main-to-eavesdropper ratio (dB)-5 00.50Ergodic secrecy capacity (b/s/Hz)11.522.533.545 10 15 20Relay selection w/M = 8Relay selection w/M = 4Relay selection w/M = 2Direct transmissionFigure 6. Intercept probability vs. MER of the direct transmissionand best relay selection schemes with gs=12 dB, ssd2 =0.5, and ssr2 = srd2 = 2.Main-to-eavesdropper ratio (dB)-5 010-310-4Intercept probability10-210-11005 10 15Direct transmissionRelay selection w/M = 2Relay selection w/M = 4Relay selection w/M = 8resources, and less transmit power becomes available for thedesired information transmission. Thus, it is of interest toinvestigate the joint optimization of security, reliability, andthroughput for the wireless physical layer, which is a challengingissue to be solved in the future.AcknowledgmentThis work was supported by the “1000 Young Talents Program”of China, the National Natural Science Foundation ofChina (Grant No. 61302104), and the Scientific ResearchFoundation of Nanjing University of Posts and Telecommunications(Grant No. NY213014).

Identity-Based Encryption with Outsourced Revocation in Cloud Computing

Identity-Based Encryption (IBE) which simplifies the public key and certificate management at Public Key Infrastructure (PKI) is an important alternative to public key encryption. However, one of the main efficiency drawbacks of IBE is the overhead computation at Private Key Generator (PKG) during user revocation. Efficient revocation has been well studied in traditional PKI setting, but the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. In this paper, aiming at tackling the critical issue of identity revocation, we introduce outsourcing computation into IBE for the first time and propose a revocable IBE scheme in the server-aided setting.

Our scheme offloads most of the key generation related operations during key-issuing and key-update processes to a Key Update Cloud Service Provider, leaving only a constant number of simple operations for PKG and users to perform locally. This goal is achieved by utilizing a novel collusion-resistant technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound the identity component and the time component. Furthermore, we propose another construction which is provable secure under the recently formulized Refereed Delegation of Computation model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

INTRODUCTION:

Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys. Therefore, sender using IBE does not need to look up public key and certificate, but directly encrypts message with receiver’s identity.

Accordingly, receiver obtaining the private key associated with the corresponding identity from Private Key Generator (PKG) is able to decrypt such ciphertext. Though IBE allows an arbitrary string as the public key which is considered as appealing advantages over PKI, it demands an efficient revocation mechanism. Specifically, if the private keys of some users get compromised, we must provide a mean to revoke such users from system. In PKI setting, revocation mechanism is realized by appending validity periods to certificates or using involved combinations of techniques.

Nevertheless, the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. As far as we know, though revocation has been thoroughly studied in PKI, few revocation mechanisms are known in IBE setting. In Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period. But this mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

In presented a revocable IBE scheme. Their scheme is built on the idea of fuzzy IBE primitive but utilizing a binary tree data structure to record users’ identities at leaf nodes. Therefore, key-update efficiency at PKG is able to be significantly reduced from linear to the height of such binary tree (i.e. logarithmic in the number ofusers). Nevertheless, we point out that though the binary tree introduction is able to achieve a relative high performance, it will result in other problems:

1) PKG has to generate a key pair for all the nodes on the path from the identity leaf node to the root node, which results in complexity logarithmic in the number of users in system for issuing a single private key.

2) The size of private key grows in logarithmic in the number of users in system, which makes it difficult in private key storage for users.

3) As the number of users in system grows, PKG has to maintain a binary tree with a large amount of nodes, which introduces another bottleneck for the global system. In tandem with the development of cloud computing, there has emerged the ability for users to buy on-demand computing from cloud-based services such as Amazon’s EC2 and Microsoft’s Windows Azure. Thus it desires a new working paradigm for introducing such cloud services into IBE revocation to fix the issue of efficiency and storage overhead described above. A naive approach would be to simply hand over the PKG’s master key to the Cloud Service Providers (CSPs).

The CSPs could then simply update all the private keys by using the traditional key update technique [4] and transmit the private keys back to unrevoked users. However, the naive approach is based on an unrealistic assumption that the CSPs are fully trusted and is allowed to access the master key for IBE system. On the contrary, in practice the public clouds are likely outside of the same trusted domain of users and are curious for users’ individual privacy. For this reason, a challenge on how to design a secure revocable IBE scheme to reduce the overhead computation at PKG with an untrusted CSP is raised.

In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally. In our scheme, as with the suggestion in realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users.

We propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component. At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).

Our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP. We also specify that 1) with the aid of KU-CSP, user needs not to contact with PKG in key-update, and in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP. 2) No secure channel or user authentication is required during key-update between user and KU-CSP. Furthermore, we consider realizing revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction

EXISTING SYSTEM:

  • Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys.
  • Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period.
  • Hanaoka et al. proposed a way for users to periodically renew their private keys without interacting with PKG.
  • Lin et al. proposed a space efficient revocable IBE mechanism from non-monotonic Attribute-Based Encryption (ABE), but their construction requires times bilinear pairing operations for a single decryption where the number of revoked users is.


DISADVANTAGES:

Boneh and Franklin mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

  • Boneh and Franklin’s suggestion is more a viable solution but impractical.
  • In Hanaoka et al system, however, the assumption required in their work is that each user needs to possess a tamper-resistant hardware device.
  • If an identity is revoked then the mediator is instructed to stop helping the user. Obviously, it is impractical since all users are unable to decrypt on their own and they need to communicate with mediator for each decryption.


PROPOSED SYSTEM:

  • In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally.
  • In our scheme, as with the suggestion, we realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users, we propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component.
  • At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).


ADVANTAGES:

  • Compared with the previous work, our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP.
  • We also specify in the aid of KU-CSP, user needs not to contact with PKG in key-update, in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP.
  • No secure channel or user authentication is required during key-update between user and KU-CSP.
  • Furthermore, we consider to realize revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model.
  • Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.


HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Back End                                :           MYSQL Server
  • Server                                      :           Apache Tomact Server
  • Script                                       :           JSP Script
  • Document                               :           MS-Office 2007


ARCHITECTURE DIAGRAM:

IMPLEMENTATION:

IBE SCHEME (IDENTITY-BASED ENCRYPTION)

ALGORITHM USED:

KEYCOMBINE ALGORITHM:

MODULES:

USER MODULES:

  • ADMIN:
  • OWNER:
  • USERS:

PKG (PRIVATE KEY GENERATOR):

KU-CSPS MODELS:

USERS REVOCATION:

PERFORMANCE EVALUATION:

CONCLUSION:

In this paper, focusing on the critical issue of identity revocation, we introduce outsourcing computation into IBE and propose a revocable scheme in which the revocation operations are delegated to CSP. With the aid of KU-CSP, the proposed scheme is full-featured: 1) It achieves constant efficiency for both computation at PKG and private key size at user; 2) User needs not to contact with PKG during keyupdate, in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP; 3) No secure channel or user authentication is required during key-update between user and KU-CSP. Furthermore, we consider realizing revocable IBE under a stronger adversary model. We present an advanced construction and show it is secure under RDoC model, in which at least one of the KU-CSPs is assumed to be honest. Therefore, even if a revoked user and either of the KU-CSPs collude, it is unable to help such user re-obtain his/her decryptability. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

Identity-Based Distributed Provable Data Possession in Multicloud Storage

3.2 DATAFLOW DIAGRAM

PUBLISHER:


SUBSCRIBER:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

PUBLISHER:

SUBSCRIBER:


3.3 CLASS DIAGRAM:

PUBLISHER:


SUBSCRIBER:

3.4 SEQUENCE DIAGRAM:

PUBLISHER:

SUBSCRIBER:


3.5 ACTIVITY DIAGRAM:

PUBLISHER:


SUBSCRIBER:


CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

4.3 MODULE DESCRIPTION:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are      

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:                  

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY:       

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

FUNCTIONAL TESTING:

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.5 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Load Testing

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

PERFORMANCE TESTING:

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

RELIABILITY TESTING:

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

SECURITY TESTING:

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

5.1.8 WHITE BOX TESTING:

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

5.1.10 BLACK BOX TESTING:

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION

Generating Searchable Public-Key Ciphertexts with Hidden Structures for Fast Keyword Search

In this paper proposes Searchable Public-Key Ciphertexts with Hidden Structures (SPCHS) for keyword search as fast as possible without sacrificing semantic security of the encrypted keywords. In SPCHS, all keyword-searchable ciphertexts are structured by hidden relations, and with the search trapdoor corresponding to a keyword, the minimum information of the relations is disclosed to a search algorithm as the guidance to find all matching ciphertexts efficiently.

We construct a SPCHS scheme from scratch in which the ciphertexts have a hidden star-like structure. We prove our scheme to be semantically secure in the Random Oracle (RO) model. The search complexity of our scheme is dependent on the actual number of the ciphertexts containing the queried keyword, rather than the number of all ciphertexts.

Finally, we present a generic SPCHS construction from anonymous identity-based encryption and collision-free full-identity malleable Identity-Based Key Encapsulation Mechanism (IBKEM) with anonymity. We illustrate two collision-free full-identity malleable IBKEM instances, which are semantically secure and anonymous, respectively, in the RO and standard models. The latter instance enables us to construct an SPCHS scheme with semantic security in the standard model.

1.2 INTRODUCTION:

We start by formally defining the concept of Searchable Public-key Ciphertexts with Hidden Structures (SPCHS) and its semantic security. In this new concept, keywordsearchable ciphertexts with their hidden structures can be generated in the public key setting; with a keyword search trapdoor, partial relations can be disclosed to guide the discovery of all matching ciphertexts. Semantic security is defined for both the keywords and the hidden structures. It is worth noting that this new concept and its semantic security are suitable for keyword-searchable ciphertexts with any kind of hidden structures. In contrast, the concept of traditional PEKS does not contain any hidden structure among the PEKS ciphertexts; correspondingly, its semantic security is only defined for the keywords. Following the SPCHS definition, we construct a simple SPCHS from scratch in the random oracle (RO) model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. The search performance mainly depends on the actual number of the ciphertexts containing the queried keyword. For security, the scheme is proven semantically secure based on the Decisional Bilinear DiffieHellman (DBDH) assumption in the RO model.

We build a generic SPCHS construction with IdentityBased Encryption (IBE) and collision-free full-identity malleable IBKEM. The resulting SPCHS can generate keyword-searchable ciphertexts with a hidden star-like structure. Moreover, if both the underlying IBKEM and IBE have semantic security and anonymity (i.e. the privacy of receivers’ identities), the resulting SPCHS is semantically secure. As there are known IBE schemes [4], [5], [6], [7] in both the RO model and the standard model, an SPCHS construction is reduced to collision-free full-identity malleable IBKEM with anonymity. We proposed several IBKEM schemes to construct Verifiable Random Functions2 (VRF). We show that one of these IBKEM schemes is anonymous and collision-free fullidentity malleable in the RO model. We transform this IBE scheme into a collision-free full-identity malleable IBKEM scheme with semantic security and anonymity in the standard model. Hence, this new IBKEM scheme allows us to build SPCHS schemes secure in the standard model with the same search performance as the previous SPCHS construction from scratch in the RO model.

1.3 LITRATURE SURVEY

TITLE: FUZZY KEYWORD SEARCH OVER ENCRYPTED DATA IN CLOUD COMPUTING

AUTOHR: Li J., Wang Q., Wang C., Cao N., Ren K., Lou W

PUBLISH:  IEEE INFOCOM 2010, pp. 1-5. (2010)

EXPLANATION:

As Cloud Computing becomes prevalent, more and more sensitive information are being centralized into the cloud. For the protection of data privacy, sensitive data usually have to be encrypted before outsourcing, which makes effective data utilization a very challenging task. Although traditional searchable encryption schemes allow a user to securely search over encrypted data through keywords and selectively retrieve files of interest, these techniques support only exact keyword search. That is, there is no tolerance of minor typos and format inconsistencies which, on the other hand, are typical user searching behavior and happen very frequently. This significant drawback makes existing techniques unsuitable in Cloud Computing as it greatly affects system usability, rendering user searching experiences very frustrating and system efficacy very low. In this paper, for the first time we formalize and solve the problem of effective fuzzy keyword search over encrypted cloud data while maintaining keyword privacy. Fuzzy keyword search greatly enhances system usability by returning the matching files when users’ searching inputs exactly match the predefined keywords or the closest possible matching files based on keyword similarity semantics, when exact match fails. In our solution, we exploit edit distance to quantify keywords similarity and develop an advanced technique on constructing fuzzy keyword sets, which greatly reduces the storage and representation overheads. Through rigorous security analysis, we show that our proposed solution is secure and privacy-preserving, while correctly realizing the goal of fuzzy keyword search.

TITLE: ANONYMOUS FUZZY IDENTITY-BASED ENCRYPTION FOR SIMILARITY SEARCH

AUTOHR: Cheung D. W., Mamoulis N., Wong W. K., Yiu S. M., Zhang

PUBLISH: ISAAC 2010. LNCS, vol. 6505, pp. 61-72. Springer, Heidelberg (2010)

EXPLANATION:

The predicate that was studied in the very beginning is “exact keyword matching”. That is, whether the value hidden by the token is equal to the attribute value hidden in the ciphertext. Schemes that only provide data item security are basically “Identity-Based Encryption”. Schemes protecting both the data item and the attributes were initiated in the private-key setting public-key setting. Relationship between and “Anonymous Identity-Based Encryption” was revisited in range query as the predicate was also considered. Boneh et al. devised an Augmented Broadcast Encryption which allows checking if the attribute value falls within a range on encrypted data. Their scheme also provides attribute protection. Then, Boneh and Waters extended it to multi-dimensional range query.

However, there is no practical scheme supporting this predicate with attribute protection in public-key settings investigated this problem in the private-key setting and is IND2-CKA secure. His scheme is in a public-key setting. However, the scheme requires the threshold value t to be fixed in the setup time. Our work is using as a framework provided schemes for handling predicates represented as inner products. Their formulation of using inner products with bounded disjunction is powerful. We show how to reduce inner products to hamming distance similarity comparison predicate, and then derive a slightly different encryption scheme for better performance when considering the inequality case. In our work, we consider the problem of attribute protection in public-key setting. In some applications, people may also want to provide protection to predicate (“the token”), which is inherently unachievable in public-key setting. Note that a predicate encryption supporting inner product in private-key setting has been devised in which can provide predicate privacy

TITLE: TRAPDOOR PRIVACY IN ASYMMETRIC SEARCHABLE ENCRYPTION SCHEMES

AUTOHR: Arriaga A., Tang Q., Ryan P

PUBLISH: AFRICACRYPT 2014. LNCS, vol. 8469, pp. 31-50. Springer, Heidelberg (2014)

EXPLANATION:

Asymmetric searchable encryption allows searches to be carried over ciphertexts, through delegation, and by means of trapdoors issued by the owner of the data. Public Key Encryption with Keyword Search (PEKS) is a primitive with such functionality that provides delegation of exact-match searches. As it is important that ciphertexts preserve data privacy, it is also important that trapdoors do not expose the user’s search criteria. The difficulty of formalizing a security model for trapdoor privacy lies in the verification functionality, which gives the adversary the power of verifying if a trapdoor encodes a particular keyword. In this paper, we provide a broader view on what can be achieved regarding trapdoor privacy in asymmetric searchable encryption schemes, and bridge the gap between previous definitions, which give limited privacy guarantees in practice against search patterns. We propose the notion of Strong Search Pattern Privacy for PEKS and construct a scheme that achieves this security notion.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing semantically secure PEKS schemes take search time linear with the total number of all ciphertexts. This makes retrieval from large-scale databases prohibitive. Therefore, more efficient search performance is crucial for practically deploying PEKS schemes. One of the prominent works to accelerate the search over encrypted keywords in the public-key setting enabling search over encrypted keywords to be as effi- cient as the search for unencrypted keywords, such that a ciphertext containing a given keyword can be retrieved in time complexity logarithmic in the total number of all ciphertexts.

This is reasonable because the encrypted keywords can form a tree-like structure when stored according to their binary values. However, deterministic encryption has two inherent limitations. First, keyword privacy can be guaranteed only for keywords that are a priori hardto-guess by the adversary (i.e., keywords with high minentropy to the adversary); second, certain information of a message leaks inevitably via the ciphertext of the keywords since the encryption is deterministic. Hence, deterministic encryption is only applicable in special scenarios.

Observe that a keyword space is usually of no high minentropy in many scenarios. Semantic security is crucial to guarantee keyword privacy in such applications. Thus the linear search complexity of existing schemes is the major obstacle to their adoption. Unfortunately, the linear complexity seems to be inevitable because the server has to scan and test each ciphertext, due to the fact that these ciphertexts (corresponding to the same keyword or not) are indistinguishable to the server.

2.1.1 DISADVANTAGES:

Each sender should be able to generate the keyword-searchable ciphertexts with the hidden star-like structure by the receiver’s public-key; the server having a keyword search trapdoor should be able to disclose partial relations, which is related to all matching ciphertexts. Semantic security is preserved 1) if no keyword search trapdoor is known, all ciphertexts are indistinguishable, and no information is leaked about the structure, and 2) given a keyword search trapdoor, only the corresponding relations can be disclosed, and the matching ciphertexts leak no information about the rest of ciphertexts, except the fact that the rest do not contain the queried keyword.

  • The integrity of data is not possible in existing system 
  • An existing system public verifier does not check the data in multi cloud


2.2 PROPOSED SYSTEM:

We propose methods of searchable Public-key Ciphertexts with Hidden Structures (SPCHS) and its semantic security. In this new concept, keywordsearchable ciphertexts with their hidden structures can be generated in the public key setting; with a keyword search trapdoor, partial relations can be disclosed to guide the discovery of all matching ciphertexts. Semantic security is defined for both the keywords and the hidden structures. Following the SPCHS definition, we construct a simple SPCHS from scratch in the random oracle (RO) model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. The search performance mainly depends on the actual number of the ciphertexts containing the queried keyword.

We are also interested in providing a generic SPCHS construction to generate keyword-searchable ciphertexts with a hidden star-like structure. Our generic SPCHS is inspired by several interesting observations on Identity-Based Key Encapsulation Mechanism (IBKEM). We build a generic SPCHS construction with IdentityBased Encryption (IBE) and collision-free full-identity malleable IBKEM. The resulting SPCHS can generate keyword-searchable ciphertexts with a hidden star-like structure. Moreover, if both the underlying IBKEM and IBE have semantic security and anonymity (i.e. the privacy of receivers’ identities), the resulting SPCHS is semantically secure. As there are known IBE schemes in both the RO model and the standard model, an SPCHS construction is reduced to collision-free full-identity malleable IBKEM.

2.2.1 ADVANTAGES:

IBKEM schemes to construct Verifiable Random Functions2 (VRF) [8]. We show that one of these IBKEM schemes is anonymous and collision-free fullidentity malleable in the RO model utilized the “approximation” of multilinear maps to construct a standard-model version of Boneh-and-Franklin (BF) IBE scheme.

We transform this IBE scheme into a collision-free full-identity malleable IBKEM scheme with semantic security and anonymity in the standard model. Hence, this new IBKEM scheme allows us to build SPCHS schemes secure in the standard model with the same search performance as the previous SPCHS construction from scratch in the RO model.

  • In our proposed system each client has a private correspond to his identity (i.e.) name, id or any…
  • The public verifier allow the user to  correspond to his identity (i.e.) private Key

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Back End                                :           MYSQL Server
  • Server                                      :           Apache Tomact Server
  • Script                                       :           JSP Script
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM


3.2 DATAFLOW DIAGRAM

LEVEL I:


LEVEL II:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:


3.3 CLASS DIAGRAM:


3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:


CHAPTER 4

4.0 IMPLEMENTATION:

SPCHS SCHEME:

We first explain intuitions behind SPCHS. We describe a hidden structure formed by ciphertexts as (C, Pri, Pub), where C denotes the set of all ciphertexts, Pri denotes the hidden relations among C, and Pub denotes the public parts. In case there is more than one hidden structure formed by ciphertexts, the description of multiple hidden structures formed by ciphertexts can be

In SPCHS, the encryption algorithm has two functionalities. One is to encrypt a keyword, and the other is to generate a hidden relation, which can associate the generated ciphertext to the hidden structure. Let (Pri, Pub) be the hidden structure. The encryption algorithm must take Pri as input, otherwise the hidden relation cannot be generated since Pub does not contain anything about the hidden relations. At the end of the encryption procedure, the Pri should be updated since a hidden relation is newly generated (but the specific method to update Pri relies on the specific instance of SPCHS). In addition, SPCHS needs an algorithm to initialize (Pri, Pub) by taking the master public key as input, and this algorithm will be run before the first time to generate a ciphertext. With a keyword search trapdoor, the search algorithm of SPCHS can disclose partial relations to guide the discovery of the ciphertexts containing the queried keyword with the hidden structure.

4.1 ALGORITHM

IBKEM ALGORITHM:

In this section, we formalize collision-free full-identity malleable IBKEM and a generic SPCHS construction from IBKEM. Our generic construction also relies on a notion of collision-free full-identity malleable IBKEM. The following IBKEM definition is derived from [47]. A difference only appears in algorithm EncapsIBKEM. In order to highlight that the generator of an IBKEM encapsulation knows the chosen random value used in algorithm EncapsIBKEM, we take the random value as an input of the algorithm.

The collision-free full-identity malleable IBKEM implies the following characteristics: all identities’ decryption keys can decapsulate the same encapsulation; all decapsulated keys are collision-free; the generator of the encapsulation can also compute these decapsulated keys; the decapsulated keys of different encapsulations are also collision-free.

 A collision-free full-identity malleable IBKEM scheme may preserve semantic security and anonymity. We incorporate the semantic security and anonymity into AnonSS-ID-CPA secure IBKEM. But this security is different from the traditional version [47] of the Anon-SS-ID-CPA security due to the full-identity malleability of IBKEM.

4.2 MODULES:

USER MODULES:

IDENTITY BASED ENCRYPTION:

FAST SEARCHABLE ENCRYPTION:

SEMANTIC DATA SECURITY:

4.3 MODULE DESCRIPTION:

USER MODULE:

  • ADMIN:

In this module is used to help the server to view details and upload files with the security. Admin upload the data’s to database. Also view the subscriber details and user details. Admin find the redistribute details. Also who send the data and receive the data’s. Data owner store large amount of data to clouds and access data using secure key provided admin after encrypting data’s. Encrypt the data using SECY. User store data after auditor, view and verifying data and also changed data. User again views data at that time admin provided the message to user only changes data.

  • PROVIDER:

In this module subscriber choose document and download the data’s from service providers. Subscribers pay the amount to service provider. Service provider provides that data key to subscriber. So subscribers download the data using data key. A cloud computing service provider serves users’ service requests by using a server system, which is constructed and maintained by an infrastructure vendor and rented by the service provider.

  • USER:

In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first user can register their details like user name, password, email, mobile no, and then. We develop this module, where the cloud storage can be made secure.

IDENTITY BASED ENCRYPTION:

Batch identity-based key distribution: A direct application of collision-free full-identity malleable IBKEM is to achieve batch identity-based key distribution. In such an application, a sender would like to distribute different secret session keys to multiple receivers so that each receiver can only know the session key to himself/herself. With collision-free full-identity malleable IBKEM, a sender just needs to broadcast an IBKEM encapsulation in the identitybased cryptography setting, e.g., encapsulating a session key K to a single user ID. According to the collisionfreeness of IBKEM, each receiver ID0 can decapsulate and obtain a different key K0 with his/her secret key in the identity based crypto-system. Due to the full-identity malleability, the sender knows the decapsulated keys of all the receivers.

Anonymous identity-based broadcast encryption: A slightly more complicated application is anonymous identity-based broadcast encryption with efficient decryption. An analogous application was proposed respectively application will work if the IBKEM is collision-free full-identity malleable. It preserves the anonymity of receivers if the IBKEM is anonymous. Note that trivial anonymous broadcast encryption suffers decryption cost linear with the number of the receivers. In contrast, our anonymous identity-based broadcast encryption enjoys constant decryption cost, plus logarithmic complexity to search the matching index in a set (K1 1 , …, KN 1 ) organized by a certain partial order, e.g., a dictionary order according to their binary representations.

FAST SEARCHABLE ENCRYPTION:

As-fast-as-possible search in PEKS with semantic security. We proposed the concept of SPCHS as a variant of PEKS. The new concept allows keyword-searchable ciphertexts to be generated with a hidden structure. Given a keyword search trapdoor, the search algorithm of SPCHS can disclose part of this hidden structure for guidance on finding out the ciphertexts of the queried keyword. Semantic security of SPCHS captures the privacy of the keywords and the invisibility of the hidden structures. We proposed an SPCHS scheme from scratch with semantic security in the RO model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. It has search complexity mainly linear with the exact number of the ciphertexts containing the queried keyword. It outperforms existing PEKS schemes with semantic security, whose search complexity is linear with the number of all ciphertexts.

We identified several interesting properties, i.e., collision-freeness and full-identity malleability in some IBKEM instances, and formalized these properties to build a generic SPCHS construction. We illustrated two collision-free full-identity malleable IBKEM instances, which are respectively secure in the RO and standard models. SPCHS seems a promising tool to solve some challenging problems in public-key searchable encryption. One application may be to achieve retrieval completeness verification which, to the best of our knowledge, has not been achieved in existing PEKS schemes. Specifically, by forming a hidden ring-like structure, i.e., letting the last hidden pointer always point to the head, one can obtain PEKS allowing to check the completeness of the retrieved ciphertexts by checking whether the pointers of the returned ciphertexts form a ring.

SEMANTIC DATA SECURITY:

The SS-CKSA security of the above SPCHS scheme relies on the DBDH assumption in Even in the case that a sender gets his local privacy Pri compromised, SPCHS still offers forward security. This means that the existing hidden structure of ciphertexts stays confidential, since the local privacy only contains the relationship of the new generated ciphertexts. To offer backward security with SPCHS, the sender can initialize a new structure by algorithm Structure Initialization for the new generated ciphertexts. A collision-free full-identity malleable IBKEM scheme may preserve semantic security and anonymity.

We incorporate the semantic security and anonymity into AnonSS-ID-CPA secure IBKEM. But this security is different from the traditional version of the Anon-SS-ID-CPA security due to the full-identity malleability of IBKEM. The difference will be introduced after defining that security. In that security, a PPT adversary is allowed to query the decryption keys for adaptively chosen identities, and adaptively choose two challenge identities. The Anon-SSID-CPA security of IBKEM means that for a challenge key-and-encapsulation pair, the adversary cannot determine the correctness of this pair and the challenge identity of this pair, given that the adversary does not know the two challenging identities’ decryption keys in the Anon-SSID-CPA security of a collision-free full-identity malleable IBKEM scheme.

The SS-sK-CKSA security of the above generic SPCHS construction relies on the AnonSS-sID-CPA security of the underlying IBKEM and the Anon-SS-ID-CPA security of the underlying IBE. In the security proof, we prove that if there is an adversary who can break the SS-sK-CKSA security of the above generic SPCHS construction, then there is another adversary who can break the Anon-SS-sID-CPA security of the underlying IBKEM or the Anon-SS-ID-CPA security of the underlying IBE.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

This paper investigated as-fast-as-possible search in PEKS with semantic security. We proposed the concept of SPCHS as a variant of PEKS. The new concept allows keyword-searchable ciphertexts to be generated with a hidden structure. Given a keyword search trapdoor, the search algorithm of SPCHS can disclose part of this hidden structure for guidance on finding out the ciphertexts of the queried keyword. Semantic security of SPCHS captures the privacy of the keywords and the invisibility of the hidden structures. We proposed an SPCHS scheme from scratch with semantic security in the RO model. The scheme generates keyword-searchable ciphertexts with a hidden star-like structure. It has search complexity mainly linear with the exact number of the ciphertexts containing the queried keyword. It outperforms existing PEKS schemes with semantic security, whose search complexity is linear with the number of all ciphertexts.

We identified several interesting properties, i.e., collision-freeness and full-identity malleability in some IBKEM instances, and formalized these properties to build a generic SPCHS construction. We illustrated two collision-free full-identity malleable IBKEM instances, which are respectively secure in the RO and standard models. SPCHS seems a promising tool to solve some challenging problems in public-key searchable encryption. One application may be to achieve retrieval completeness verification which, to the best of our knowledge, has not been achieved in existing PEKS schemes. Specifically, by forming a hidden ring-like structure, i.e., letting the last hidden pointer always point to the head, one can obtain PEKS allowing to check the completeness of the retrieved ciphertexts by checking whether the pointers of the returned ciphertexts form a ring.

Friendbook A Semantic-Based Friend Recommendation System for Social Networks

Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs. By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents, from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm.

We further propose a similarity metric to measure the similarity of life styles between users, and calculate users’ impact in terms of life styles with a friend-matching graph. Upon receiving a request, Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a feedback mechanism to further improve the recommendation accuracy. We have implemented Friendbook on the Android-based smartphones, and evaluated its performance on both small-scale experiments and large-scale simulations. The results show that the recommendations accurately reflect the preferences of users in choosing friends.

1.2 INTRODUCTION:

 

What Is A Social Network?

Wikipedia defines a social network service as a service which “focuses on the building and verifying of online social networks for communities of people who share interests and activities, or who are interested in exploring the interests and activities of others, and which necessitates the use of software.”

A report published by OCLC provides the following definition of social networking sites: “Web sites primarily designed to facilitate interaction between users who share interests, attitudes and activities, such as Facebook, Mixi and MySpace.”

What Can Social Networks Be Used For?

Social networks can provide a range of benefits to members of an organization:

Support for learning: Social networks can enhance informal learning and support social connections within groups of learners and with those involved in the support of learning.

Support for members of an organisation:  Social networks can potentially be used my all members of an organisation, and not just those involved in working with students. Social networks can help the development of communities of practice.

Engaging with others: Passive use of social networks can provide valuable business intelligence and feedback on institutional services (although this may give rise to ethical concerns).

Ease of access to information and applications: The ease of use of many social networking services can provide benefits to users by simplifying access to other tools and applications. The Facebook Platform provides an example of how a social networking service can be used as an environment for other tools.

Common interface: A possible benefit of social networks may be the common interface which spans work / social boundaries. Since such services are often used in a personal capacity the interface and the way the service works may be familiar, thus minimising training and support needed to exploit the services in a professional context.  This can, however, also be a barrier to those who wish to have strict boundaries between work and social activities.

Examples of popular social networking services include:

Facebook: Facebook is a social networking Web site that allows people to communicate with their friends and exchange information. In May 2007 Facebook launched the Facebook Platform which provides a framework for developers to create applications that interact with core Facebook features

MySpace: MySpace is a social networking Web site offering an interactive, user-submitted network of friends, personal profiles, blogs and groups, commonly used for sharing photos, music and videos.

Ning: An online platform for creating social Web sites and social networks aimed at users who want to create networks around specific interests or have limited technical skills.

Twitter: Twitter is an example of a micro-blogging service. Twitter can be used in a variety of ways including sharing brief information with users and providing support for one’s peers.

Note that this brief list of popular social networking services omits popular social sharing services such as Flickr and YouTube.

Opportunities and Challenges

The popularity and ease of use of social networking services have excited institutions with their potential in a variety of areas. However effective use of social networking services poses a number of challenges for institutions including long-term sustainability of the services; user concerns over use of social tools in a work or study context; a variety of technical issues and legal issues such as copyright, privacy, accessibility; etc.

Institutions would be advised to consider carefully the implications before promoting significant use of such services.

Twenty years ago, people typically made friends with others who live or work close to themselves, such as neighbors or colleagues. We call friends made through this traditional fashion as G-friends, which stands for geographical location-based friends because they are influenced by the geographical distances between each other. With the rapid advances in social networks, services such as Facebook, Twitter and Google+ have provided us revolutionary ways of making friends. According to Facebook statistics, a user has an average of 130 friends, perhaps larger than any other time in history. One challenge with existing social networking services is how to recommend a good friend to a user. Most of them rely on pre-existing user relationships to pick friend candidates.

For example, Facebook relies on a social link analysis among those who already share common friends and recommends symmetrical users as potential friends. Unfortunately, this approach may not be the most appropriate based on recent sociology findings. According to these studies, the rules to group people together include: 1) habits or life style; 2) attitudes; 3) tastes; 4) moral standards; 5) economic level; and 6) people they already know. Rather, life styles are usually closely correlated with daily routines and activities. Therefore, if we could gather information on users’ daily routines and activities, we can exploit rule #1 and recommend friends to people based on their similar life styles. This recommendation mechanism can be deployed as a standalone app on smartphones or as an add-on to existing social network frameworks. In both cases, Friendbook can help mobile phone users find friends either among strangers or within a certain group as long as they share similar life styles.

1.3 LITRATURE SURVEY:

1) “Probabilistic mining of socio geographic routines from mobile phone data”

AUTHORS:  K. Farrahi and D. Gatica-Perez

There is relatively little work on the investigation of large-scale human data in terms of multimodality for human activity discovery. In this paper, we suggest that human interaction data, or human proximity, obtained by mobile phone Bluetooth sensor data, can be integrated with human location data, obtained by mobile cell tower connections, to mine meaningful details about human activities from large and noisy datasets. We propose a model, called bag of multimodal behavior that integrates the modeling of variations of location over multiple time-scales, and the modeling of interaction types from proximity. Our representation is simple yet robust to characterize real-life human behavior sensed from mobile phones, which are devices capable of capturing large-scale data known to be noisy and incomplete. We use an unsupervised approach, based on probabilistic topic models, to discover latent human activities in terms of the joint interaction and location behaviors of 97 individuals over the course of approximately a 10-month period using data from MIT’s Reality Mining project. Some of the human activities discovered with our multimodal data representation include “going out from 7 pm-midnight alone” and “working from 11 am-5 pm with 3-5 other people,” further finding that this activity dominantly occurs on specific days of the week. Our methodology also finds dominant work patterns occurring on other days of the week. We further demonstrate the feasibility of the topic modeling framework for human routine discovery by predicting missing multimodal phone data at specific times of the day.

2. Collaborative and structural recommendation of friends using weblog-based social network analysis

AUTHORS:  W. H. Hsu, A. King, M. Paradesi, T. Pydimarri, and T. Weninger

In this paper, we address the problem of link recommendation in weblogs and similar social networks. First, we present an approach based on collaborative recommendation using the link structure of a social network and content-based recommendation using mutual declared interests. Next, we describe the application of this approach to a small representative subset of a large real-world social network: the user/community network of the blog service Live Journal. We then discuss the ground features available in Live Journal’s public user information pages and describe some graph algorithms for analysis of the social network. These are used to identify candidates, provide ground truth for recommendations, and construct features for learning the concept of a recommended link. Finally, we compare the performance of this machine learning approach to that of the rudimentary recommender system provided by Live Journal.

3. Understanding Transportation Modes Based on GPS Data for Web Applications.

AUTHORS:  Y. Zheng, Y. Chen, Q. Li, X. Xie, and W.-Y. Ma.

User mobility has given rise to a variety of Web applications, in which the global positioning system (GPS) plays many important roles in bridging between these applications and end users. As a kind of human behavior, people’s transportation modes, such as walking and driving, can provide pervasive computing systems with more contextual information and enrich a user’s mobility with informative knowledge. In this article, we report on an approach based on supervised learning to automatically infer users’ transportation modes, including driving, walking, taking a bus and riding a bike, from raw GPS logs. Our approach consists of three parts: a change point-based segmentation method, an inference model and a graph-based post-processing algorithm. First, we propose a change point-based segmentation method to partition each GPS trajectory into separate segments of different transportation modes. Second, from each segment, we identify a set of sophisticated features, which are not affected by differing traffic conditions (e.g., a person’s direction when in a car is constrained more by the road than any change in traffic conditions). Later, these features are fed to a generative inference model to classify the segments of different modes. Third, we conduct graph-based post-processing to further improve the inference performance. This post-processing algorithm considers both the commonsense constraints of the real world and typical user behaviors based on locations in a probabilistic manner. The advantages of our method over the related works include three aspects. 1) Our approach can effectively segment trajectories containing multiple transportation modes. 2) Our work mined the location constraints from user-generated GPS logs, while being independent of additional sensor data and map information like road networks and bus stops. 3) The model learned from the dataset of some users can be applied to infer GPS data from others. Using the GPS logs collected by 65 people over a period of 10 months, we evaluated our approach via a set of experiments. As a result, based on the change-point-based segmentation method and Decision Tree-based inference model, we achieved prediction accuracy greater than 71 percent. Further, using the graph-based post-processing algorithm, the performance attained a 4-percent enhancement.

4. Online friend recommendation through personality matching and collaborative filtering

AUTHORS: L. Bian and H. Holtzman

Most social network websites rely on people’s proximity on the social graph for friend recommendation. In this paper, we present Matchmaker, a collaborative filtering friend recommendation system based on personality matching. The goal of Matchmaker is to leverage the social information and mutual understanding among people in existing social network connections, and produce friend recommendations based on rich contextual data from people’s physical world interactions. Matchmaker allows users’ network to match them with similar TV characters, and uses relationships in the TV programs as parallel comparison matrix to suggest to the users friends that have been voted to suit their personality the best. The system’s ranking schema allows progressive improvement on the personality matching consensus and more diverse branching of users’ social network connections. Lastly, our user study shows that the application can also induce more TV content consumption by driving users’ curiosity in the ranking process.

CHAPTER 2

2.0 SYSTEM ANALYSIS:

2.1 EXISTING SYSTEM:

Most of the friend suggestions mechanism relies on pre-existing user relationships to pick friend candidates. For example, Facebook relies on a social link analysis among those who already share common friends and recommends symmetrical users as potential friends. The rules to group people together include:

  1. Habits or life style
  2. Attitudes
  3. Tastes
  4. Moral standards
  5. Economic level; and
  6. People they already know.

Apparently, rule #3 and rule #6 are the mainstream factors considered by existing recommendation systems.

2.1.1 DISADVANTAGES:

  • Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life


2.2 PROPOSED SYSTEM:

  • A novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs.
  • By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity.
  • We model a user’s daily life as life documents, from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm.
  • Similarity metric to measure the similarity of life styles between users, and calculate users’
  • Impact in terms of life styles with a friend-matching graph.
  • We integrate a linear feedback mechanism that exploits the user’s feedback to improve recommendation accuracy.


2.2.1 ADVANTAGES:

  • Recommend potential friends to users if they share similar life styles.
  • The feedback mechanism allows us to measure the satisfaction of users, by providing a user interface that allows the user to rate the friend list

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Back End                                :           MYSQL Server
  • Server                                      :           Apache Tomact Server
  • Script                                       :           JSP Script
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM:


3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:


3.3 CLASS DIAGRAM:


3.4 SEQUENCE DIAGRAM:


3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

  • LIFE STYLE MODELING
  • ACTIVITY RECOGNITION
  • FRIEND-MATCHING GRAPH CONSTRUCTION
  • USER IMPACT RANKING


4.3 MODULE DESCRIPTION:

LIFE STYLE MODELING:

Life styles and activities are reflections of daily lives at two different levels where daily lives can be treated as a mixture of life styles and life styles as a mixture of activities. This is analogous to the treatment of documents as ensemble of topics and topics as ensemble of words. By taking advantage of recent developments in the field of text mining, we model the daily lives of users as life documents, the life styles as topics, and the activities as words. Given “documents”, the probabilistic topic model could discover the probabilities of underlying “topics”. Therefore, we adopt the probabilistic topic model to discover the probabilities of hidden “life styles” from the “life documents”. Our objective is to discover the life style vector for each user given the life documents of all users.

ACTIVITY RECOGNITION:

We need to first classify or recognize the activities of users. Life styles are usually reflected as a mixture of motion activities with different occurrence probability. Generally speaking, there are two mainstream approaches: supervised learning and unsupervised learning. For both approaches, mature techniques have been developed and tested. In practice, the number of activities involved in the analysis is unpredictable and it is difficult to collect a large set of ground truth data for each activity, which makes supervised learning algorithms unsuitable for our system. Therefore, we use unsupervised learning approaches to recognize activities.

FRIEND-MATCHING GRAPH CONSTRUCTION:

To characterize relations among users, in this section, we propose the friend-matching graph to represent the similarity between their life styles and how they influence other people in the graph. In particular, we use the link weight between two users to represent the similarity of their life styles. Based on the friend-matching graph, we can obtain a user’s affinity reflecting how likely this user will be chosen as another user’s friend in the network. We define a new similarity metric to measure the similarity between two life style vectors.  Based on the similarity metric, we model the relations between users in real life as a friend-matching graph. The friend-matching graph has been constructed to reflect life style relations among users.

USER IMPACT RANKING:

The impact ranking means a user’s capability to establish friendships in the network. In other words, the higher the ranking, the easier the user can be made friends with, because he/she shares broader life styles with others. Once the ranking of a user is obtained, it provides guidelines to those who receive the recommendation list on how to choose friends. The ranking itself, however, should be independent from the query user. In other words, the ranking depends only on the graph structure of the friend-matching graph, which contains two aspects: 1) how the edges are connected; 2) how much weight there is on every edge. Moreover, the ranking should be used together with the similarity scores between the query user and the potential friend candidates, so that the recommended friends are those who not only share sufficient similarity with the query user, and are also popular ones through whom the query user can increase their own impact rankings.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

INDEX PAGE:

ADMIN LOGIN:

ADMIN HOME PAGE:

USER LIST:

NEW USER REGISTRATION:

USER LOGIN:

USERHOME PAGE:

ADDING FRIENDS:

MY FRIENDS LIST:

RECOMMEND SITES FROM FRIENDS:

INDEX PAGE:


7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION & FUTURE ENHANCEMENT:

In this paper, we presented the design and implementation of Friendbook, a semantic-based friend recommendation system for social networks. Different from the friend recommendation mechanisms relying on social graphs in existing social networking services, Friendbook extracted life styles from user-centric data collected from sensors on the smartphone and recommended potential friends to users if they share similar life styles. We implemented Friendbook on the Android-based smartphones, and evaluated its performance on both smallscale experiments and large-scale simulations. The results showed that the recommendations accurately reflect the preferences of users in choosing friends. Beyond the current prototype, the future work can be four-fold. First, we would like to evaluate our system on large-scale field experiments. Second, we intend to implement the life style extraction using LDA and the iterative matrix-vector multiplication method in user impact ranking incrementally, so that Friendbook would be scalable to large-scale systems. Third, the similarity threshold used for the friend-matching graph is fixed in our current prototype of Friendbook.

Our explore the adaption of the threshold for each edge and see whether it can better represent the similarity relationship on the friend-matching graph. At last, we plan to incorporate more sensors on the mobile phones into the system and also utilize the information from wearable equipments (e.g., Fitbit, iwatch, Google glass, Nike+, and Galaxy Gear) to discover more interesting and meaningful life styles. For example, we can incorporate the sensor data source from Fitbit, which extracts the user’s daily fitness infograph, and the user’s place of interests from GPS traces to generate an infograph of the user as a “document”. From the infograph, one can easily visualize a user’s life style which will make more sense on the recommendation. Actually, we expect to incorporate Friendbook into existing social services (e.g., Facebook, Twitter, LinkedIn) so that Friendbook can utilize more information for life discovery, which should improve the recommendation experience in the future.

Energy Efficient Virtual Network Embedding for Cloud Networks

In this paper, we propose an energy efficient virtual network embedding (EEVNE) approach for cloud computing networks, where power savings are introduced by consolidating resources in the network and data centers. We model our approach in an IP over WDM network using mixed integer linear programming (MILP). The performance of the EEVNE approach is compared with two approaches from the literature: the bandwidth cost approach (CostVNE) and the energy aware approach (VNE-EA). The CostVNE approach optimizes the use of available bandwidth, while the VNE-EA approach minimizes the power consumption by reducing the number of activated nodes and links without taking into account the granular power consumption of the data centers and the different network devices.

The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model under an energy inefficient data center power profile. We develop a heuristic, real-time energy optimized VNE (REOViNE), with power savings approaching those of the EEVNE model. We also compare the different approaches adopting energy efficient data center power profile. Furthermore, we study the impact of delay and node location constraints on the energy efficiency of virtual network embedding. We also show how VNE can impact the design of optimally located data centers for minimal power consumption in cloud networks. Finally, we examine the power savings and spectral efficiency benefits that VNE offers in optical orthogonal division multiplexing networks.

  1. INTRODUCTION:

The ever growing uptake of cloud computing as a widely accepted computing paradigm calls for novel architectures to support QoS and energy efficiency in networks and data centers. Estimates indicate that in the long term, if current trends continue, the annual energy bill paid by data center operators will exceed the cost of equipment. Given the ecological and economic impact, both academia and industry are focusing efforts on developing energy efficient paradigms for cloud computing. In, the authors stated that the success of future cloud networks where clients are expected to be able to specify the data rate and processing requirements for hosted applications and services will greatly depend on network virtualization. The form of cloud computing service offering under study here is Infrastructure as a Service (IaaS). IaaS is the delivery of virtualized and dynamically scalable computing power, storage and networking on demand to clients on a pay as you go basis.

Network virtualization allows multiple heterogeneous virtual network architectures (comprising virtual nodes and links) to coexist on a shared physical platform, known as the substrate network which is owned and operated by an infrastructure provider (InP) or cloud service provider whose aim is to earn a profit from leasing network resources to its customers (Service Providers (SPs)). It provides scalability, customised and on demand allocation of resources and the promise of efficient use of network resources. Network virtualization is therefore a strong proponent for the realization of an efficient IaaS framework in cloud networks. InPs should have a resource allocation framework that reserves and allocates physical resources to elements such as virtual nodes and virtual links. Resource allocation is done using a class of algorithms commonly known as “virtual network embedding (VNE)” algorithms. The dynamic mapping of virtual resources onto the physical hardware maximizes the benefits gained from existing hardware. The VNE problem can be either Offline or Online. In offline problems all the virtual network requests (VNRs) are known and scheduled in advance while for the online problem, VNRs arrive dynamically and can stay in the network for an arbitrary duration.

Both online and offline problems are known to be NPhard. With constraints on virtual nodes and links, the offline VNE problem can be reduced to the NP-hard multiway separator problem, as a result, most of the work done in this area has focused on the design of heuristic algorithms and the use of networks with minimal complexity when solving mixed integer linear programming (MILP) models. Network virtualization has been proposed as an enabler of energy savings by means of resource consolidation. In all these proposals, the VNE models and/or algorithms do not address the link embedding problem as a multi-layer problem spanning from the virtualization layer through the IP layer and all the way to the optical layer. Except for the authors in, the others do not consider the power consumption of network ports/links as being related to the actual traffic passing through them.

On the contrary, we take a very generic, detailed and accurate approach towards energy efficient VNE (EEVNE) where we allow the model to decide the optimum approach to minimize the total network and data centers server power consumption. We consider the granular power consumption of various network elements that form the network engine in backbone networks as well as the power consumption in data centers. We develop a MILP model and a real-time heuristic to represent the EEVNE approach for clouds in IP over WDM networks with data centers. We study the energy efficiency considering two different power consumption profiles for servers in data centers; An energy inefficient power profile and an energy efficient power profile. Our work also investigates the impact of location and delay constraints in a practical enterprise solution of VNE in clouds. Furthermore we show how VNE can impact the design problem of optimally locating data centers for minimal power consumption in cloud networks.

  1. LITRATURE SURVEY:

RESOURCE ALLOCATION IN A NETWORK-BASED CLOUD COMPUTING ENVIRONMENT: DESIGN CHALLENGES

AUTHOR: M. A. Sharkh, M. Jammal, A. Shami, and A. Ouda

PUBLISH: IEEE Commun. Mag., vol. 51, no. 11, pp. 46–52, 2013.

EXPLANATION:

Cloud computing is a utility computing paradigm that has become a solid base for a wide array of enterprise and end-user applications. Providers offer varying service portfolios that differ in resource configurations and provided services. A comprehensive solution for resource allocation is fundamental to any cloud computing service provider. Any resource allocation model has to consider computational resources as well as network resources to accurately reflect practical demands. Another aspect that should be considered while provisioning resources is energy consumption. This aspect is getting more attention from industrial and government parties. Calls for the support of green clouds are gaining momentum. With that in mind, resource allocation algorithms aim to accomplish the task of scheduling virtual machines on the servers residing in data centers and consequently scheduling network resources while complying with the problem constraints. Several external and internal factors that affect the performance of resource allocation models are introduced in this article. These factors are discussed in detail, and research gaps are pointed out. Design challenges are discussed with the aim of providing a reference to be used when designing a comprehensive energy-aware resource allocation model for cloud computing data centers.

DISTRIBUTED ENERGY EFFICIENT CLOUDS OVER CORE NETWORKS

AUTHOR: A. Q. Lawey, T. E. H. El-Gorashi, and J. M. H. Elmirghani

PUBLISH: IEEE J. Lightw. Technol., vol. 32, no. 7, pp. 1261–1281, Jan. 2014.

EXPLANATION:

In this paper, we introduce a framework for designing energy efficient cloud computing services over non-bypass IP/WDM core networks. We investigate network related factors including the centralization versus distribution of clouds and the impact of demand, content popularity and access frequency on the clouds placement, and cloud capability factors including the number of servers, switches and routers and amount of storage required in each cloud. We study the optimization of three cloud services: cloud content delivery, storage as a service (StaaS), and virtual machines (VMS) placement for processing applications. First, we develop a mixed integer linear programming (MILP) model to optimize cloud content delivery services. Our results indicate that replicating content into multiple clouds based on content popularity yields 43% total saving in power consumption compared to power un-aware centralized content delivery. Based on the model insights, we develop an energy efficient cloud content delivery heuristic, DEER-CD, with comparable power efficiency to the MILP results. Second, we extend the content delivery model to optimize StaaS applications. The results show that migrating content according to its access frequency yields up to 48% network power savings compared to serving content from a single central location. Third, we optimize the placement of VMs to minimize the total power consumption. Our results show that slicing the VMs into smaller VMs and placing them in proximity to their users saves 25% of the total power compared to a single virtualized cloud scenario. We also develop a heuristic for real time VM placement (DEER-VM) that achieves comparable power savings.

Reducing power consumption in embedding virtual infrastructures

AUTHOR: B. Wang, X. Chang, J. Liu, and J. K. Muppala

PUBLISH: c. IEEE Globecom Workshops, Dec. 3–7, 2012, pp. 714–718.

EXPLANATION:

Network virtualization is considered to be not only an enabler to overcome the inflexibility of the current Internet infrastructure but also an enabler to achieve an energy-efficient Future Internet. Virtual network embedding (VNE) is a critical issue in network virtualization technology. This paper explores a joint power-aware node and link resource allocation approach to handle the VNE problem with the objective of minimizing energy consumption. We first present a generalized power consumption model of embedding a VN. Then we formulate the problem as a mixed integer program and propose embedding algorithms. Simulation results demonstrate that the proposed algorithms perform better than the existing algorithms in terms of the power consumption in the overprovisioned scenarios.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods of disaster-resilient optical datacenter networks through integer linear programming (ILP) and heuristics addressed content placement, routing, and protection of network and content for geographically distributed cloud services delivered by optical networks models and heuristics are developed to minimize delay and power consumption of clouds over IP/WDM networks. The authors of exploited anycast routing by intelligently selecting destinations and routes for users traffic served by clouds over optical networks, as opposed to unicast traffic, while switching off unused network elements. A unified, online, and weighted routing and scheduling algorithm is presented in for a typical optical cloud infrastructure considering the energy consumption of the network and IT resources.

In the authors provided an optimization-based framework, where the objective functions range from minimizing the energy and bandwidth cost to minimizing the total carbon footprint subject to QoS constraints. Their model decides where to build a data center, how many servers are needed in each datacenter and how to route requests. In we built a MILP model to study the energy efficiency of public cloud for content delivery over non-bypass IP/WDM core networks. The model optimizes clouds external factors including the location of the cloud in the IP/WDM network and whether the cloud should be centralized or distributed and cloud internal capability factors including the number of servers, internal LAN switches, routers, and amount of storage required in each cloud.

2.1.1 DISADVANTAGES:

(i) Studying the impact of small content (storage) size on the energy efficiency of cloud content delivery

(ii) Developing a real time heuristic for energy aware content delivery based on the content delivery model insights,

(iii) Extending the content delivery model to study the Storage as a Service (StaaS) application,

(iv) ILP model for energy aware cloud VM placement and designing a heuristic to mimic the model behaviour in real time.

2.2 PROPOSED SYSTEM:

We developed a MILP model which attempts to minimize the bandwidth cost of embedding a VNR. In the virtual network embedding energy aware (VNE-EA) model minimized the energy consumption by imposing the notion that the power consumption is minimized by switching off substrate links and nodes. The authors also assume that the power saved in switching off a substrate link is the same as the power saved by switching off a substrate node.

In the authors assumed that the power consumption in the network is insensitive to the number of ports used. They also seek to minimize the number of active working nodes and links. Botero and Hesselbach have proposed a model for energy efficiency using load balancing and have also developed a dynamic heuristic that reconfigures the embedding for energy efficiency once it is performed. They have implemented and evaluated their MILP models and heuristic algorithms using the ALEVIN Framework. The ALEVIN Framework is a good tool for developing, comparing and analyzing VNE algorithms.

The performance of the EEVNE approach is compared with two approaches from the literature: the bandwidth cost approach (CostVNE) and the energy aware approach (VNE-EA). The CostVNE approach optimizes the use of available bandwidth, while the VNE-EA approach minimizes the power consumption by reducing the number of activated nodes and links without taking into account the granular power consumption of the data centers and the different network devices.

The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model under energy inefficient data center power profile. We develop a heuristic, real-time energy optimized VNE (REOViNE), with power savings approaching those of the EEVNE model.

2.2.1 ADVANTAGES:

We are however unable to compare our model and heuristic to the implemented algorithms on the platform for the following reasons:

1. Our input parameters are not compatible to the existing models and algorithms on the platform. Extensive extensions to the algorithms and models would be needed for them to include the optical layer. Our parameters include among others; the distance in km between links for us to determine the number of EDFA’s or Regenerators needed on a link, the wavelength rate, the number of wavelengths in a fiber, the power consumption of EDFAs, transponders, regenerators, router ports, optical cross connects, multiplexers, de-multiplexers, etc.

2. The assumptions made in the calculation of power in our model and the models on the platform are different. We define the power consumption to its fine granularity to include power consumed due to traffic on each element that forms the network engine. One of our main contributions in this work is the inclusion of the optical layer in link embedding which is currently not supported by any of the algorithms on the ALEVIN platform.

We developed a generalized power consumption model of embedding a VNR and formulated it as a MILP model; however, they also assumed that the power consumption of the network ports is independent of traffic. In the authors propose a trade-off between maximizing the number of VNRs that can be accommodated by the InP and minimizing the energy cost of the whole system. They propose embedding requests in regions with the lowest electricity cost.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Document                               :           MS-Office 2007


CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM:


3.2 DATAFLOW DIAGRAM:

SENSOR NODE:


MOBILE RELAY NODE:

SINK:


UML DIAGRAMS:

3.2 USE CASE DIAGRAM:


3.3 CLASS DIAGRAM:


3.4 SEQUENCE DIAGRAM:


3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

NSFNET NETWORK:

To evaluate the performance of the proposed model and heuristic, the NSFNET network is used as the substrate network. NSFNET comprises 14 nodes and 21 links as shown in Fig. 6. We consider a scenario in which each node in NSFNET hosts a small data center of 500 servers to offer cloud services. Table I shows the parameters used. The power consumption of the network devices we have used are consistent with our previous work in which are derived from. The IP router ports are the most energy consuming devices in the network. We have adopted the Dell Power Edge R720 [26] server power specifications. We adapted the CostVNE model and the VNE-EA model for the IP over WDM network architecture.

Our EEVNE model and the REOViNE heuristic in terms of power consumption and number of accepted requests objective functions of the two models in the CostVNE model has resulted in the minimum network power consumption as it optimizes the use of bandwidth of the substrate network by consolidating wavelengths regardless of the number of data centers activated (see Fig. 7(a)). Compared to the EEVNE model, the CostVNE model has saved a maximum of 5% (average 3%) of the network power consumption. The EEVNE model, where the energy consumption is minimized by jointly optimizing the use of network resources and consolidating resources in data centers, has resulted in better power savings compared to the VNE-EA model where the power consumption is minimized by switching off substrate links and nodes. This is because the network power consumption is mainly a function of the number of wavelengths rather than the number of active links as the number of wavelengths used determines the power consumption of router ports and transponders, the most power consuming devices in the network (see Table I). The REOViNE heuristic approaches the EEVNE model in terms of the network power consumption.

Power consumption of data centers under the different models and heuristic. As mentioned above, the CostVNE model does not take into account the number of activated data centers, therefore it performs very poorly as far as the power consumption in data centers is concerned. However, as the network gets fully loaded and all the data centers are activated, the EEVNE model loses its merit over the CostVNE model. For a limited number of requests, the VNE-EA model performs just as good as the EEVNE model. However as the number of requests increases, the VNE-EA model tends to route the virtual links through multiple hops to minimize the number of activated links and data centers and therefore consumes more power. The REOViNE heuristic also approaches the EEVNE performance in terms of the data centers power consumption.

4.1 ALGORITHM:

VIRTUAL NETWORK EMBEDDING (VNE):

Resource allocation is done using a class of algorithms commonly known as “virtual network embedding (VNE)” algorithms. The dynamic mapping of virtual resources onto the physical hardware maximizes the benefits gained from existing hardware. The VNE problem can be either Offline or Online. In offline problems all the virtual network requests (VNRs) are known and scheduled in advance while for the online problem, VNRs arrive dynamically and can stay in the network for an arbitrary duration. Both online and offline problems are known to be NPhard. With constraints on virtual nodes and links, the offline VNE problem can be reduced to the NP-hard multiway separator problem, as a result, most of the work done in this area has focused on the design of heuristic algorithms and the use of networks with minimal complexity when solving mixed integer linear programming (MILP) models.

Network virtualization has been proposed as an enabler of energy savings by means of resource consolidation. In all these proposals, the VNE models and/or algorithms do not address the link embedding problem as a multi-layer problem spanning from the virtualization layer through the IP layer and all the way to the optical layer. Except for the authors in [14], the others do not consider the power consumption of network ports/links as being related to the actual traffic passing through them. On the contrary, we take a very generic, detailed and accurate approach towards energy efficient VNE (EEVNE) where we allow the model to decide the optimum approach to minimize the total network and data centers server power consumption.

We consider the granular power consumption of various network elements that form the network engine in backbone networks as well as the power consumption in data centers. We develop a MILP model and a real-time heuristic to represent the EEVNE approach for clouds in IP over WDM networks with data centers. We study the energy efficiency considering two different power consumption profiles for servers in data centers; an energy inefficient power profile and an energy efficient power profile. Our work also investigates the impact of location and delay constraints in a practical enterprise solution of VNE in clouds. Furthermore we show how VNE can impact the design problem of optimally locating data centers for minimal power consumption in cloud networks.

4.2 MODULES:

SERVER CLIENT MODULE:

VIRTUAL NETWORK EMBEDDING:

MILP MODEL FOR EEVNE:

ENERGY EFFICIENT NETWORKS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

VIRTUAL NETWORK EMBEDDING:

MILP MODEL FOR EEVNE:

ENERGY EFFICIENT NETWORKS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION:

This paper has investigated the energy efficiency of virtual network embedding in IP over WDM networks. We developed a MILP model (EEVNE) and a heuristic (REOViNE) to optimize the use of wavelengths in the network in addition to consolidating the use of resources in data centers. The results show that the EEVNE model achieves a maximum power saving of 60% (average 20%) compared to the CostVNE model which minimizes the bandwidth cost of embedding a VNR. The EEVNE model has also higher power savings compared to the virtual network embedding energy aware (VNE-EA) model from the literature. We have demonstrated that when it comes to energy savings in the network, it is not sufficient to develop models that just turn off links and nodes in the network but it is important to consider all the power consuming devices in the network and then minimize their power consumption as a whole. The REOViNE heuristic’s power savings and number of accepted requests approaches those of the MILP model. We have also investigated the performance of the models under non uniform load distributions showing that EEVNE model has superior power savings in most load conditions.

8.2 FUTURE WORK:

We have gone further to show the energy efficiency of VNE considering energy efficient data center power profile. The results show that the optimal VNE approach with the minimum power consumption is the one that only minimizes the use of network bandwidth, in this case, the CostVNE model. This however only applies when it is assumed that the network is not reconfigured when embedding new requests. We have also studied the power savings achieved by removing geographical redundancy constraints when embedding protection and load balancing virtual nodes and observed that the power savings obtained as a result can guide service providers in determining cost reductions offered to enterprise customers not requiring full geographical redundancy. We have shown how VNE can impact the optimal locations of data centers for minimal network power consumption in cloud networks. The results show that the selection of a location to host a data center is governed by two factors: the average hop count to other nodes and the client population of the candidate node and its neighbours (assuming a given average rate per user). Finally, we have developed a MILP model for VNE in O-OFDM based cloud networks and shown that they have improved power and spectral efficiency compared to conventional WDM based networks.

Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictionaries over Encrypted Clou

Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictionaries over Encrypted Cloud DataAbstract—Using cloud computing, individuals can store their data on remote servers and allow data access to public users through thecloud servers. As the outsourced data are likely to contain sensitive privacy information, they are typically encrypted before uploaded tothe cloud. This, however, significantly limits the usability of outsourced data due to the difficulty of searching over the encrypted data. Inthis paper, we address this issue by developing the fine-grained multi-keyword search schemes over encrypted cloud data. Our originalcontributions are three-fold. First, we introduce the relevance scores and preference factors upon keywords which enable the precisekeyword search and personalized user experience. Second, we develop a practical and very efficient multi-keyword search scheme.The proposed scheme can support complicated logic search the mixed “AND”, “OR” and “NO” operations of keywords. Third, we furtheremploy the classified sub-dictionaries technique to achieve better efficiency on index building, trapdoor generating and query. Lastly,we analyze the security of the proposed schemes in terms of confidentiality of documents, privacy protection of index and trapdoor,and unlinkability of trapdoor. Through extensive experiments using the real-world dataset, we validate the performance of the proposedschemes. Both the security analysis and experimental results demonstrate that the proposed schemes can achieve the same securitylevel comparing to the existing ones and better performance in terms of functionality, query complexity and efficiency.Index Terms—Searchable encryption, Multi-keyword, Fine-grained, Cloud computing.F1 INTRODUCTIONTHE cloud computing treats computing as a utility andleases out the computing and storage capacities to thepublic individuals [1], [2], [3]. In such a framework, theindividual can remotely store her data on the cloud server,namely data outsourcing, and then make the cloud data openfor public access through the cloud server. This represents amore scalable, low-cost and stable way for public data accessbecause of the scalability and high efficiency of cloud servers,and therefore is favorable to small enterprises._ H. Li and Y. Yang are with the School of Computer Science andEngineering, University of Electronic Science and Technology of China,Chengdu, Sichuan, China (e-mail: hongweili@uestc.edu.cn; yangyi.buku@gmail.com)._ H. Li is with State Key Laboratory of Information Security (Institute ofInformation Engineering, Chinese Academy of Sciences, Beijing 100093)(e-mail: hongweili@uestc.edu.cn)._ T. Luan is with the School of Information Technology, Deakin University,Melbourne, Australia(e-mail: tom.luan@deakin.edu.au)._ X. Liang is with the Department of Computer Science, Dartmouth College,Hanover, USA(e-mail: Xiaohui.Liang@dartmouth.edu)._ L. Zhou is with the National Key Laboratory of Science and Technologyon Communication, University of Electronic Science and Technology ofChina, China(e-mail: lzhou@uestc.edu.cn)._ X. Shen is with the Department of Electrical and Computer Engineering,University of Waterloo,Waterloo, Ontario, Canada(e-mail:sshen@uwaterloo.ca).Note that the outsourced data may contain sensitive privacyinformation. It is often necessary to encrypt the private databefore transmitting the data to the cloud servers [4], [5].The data encryption, however, would significantly lower theusability of data due to the difficulty of searching over theencrypted data [6]. Simply encrypting the data may stillcause other security concerns. For instance, Google Searchuses SSL (Secure Sockets Layer) to encrypt the connectionbetween search user and Google server when private data,such as documents and emails, appear in the search results [7].However, if the search user clicks into another website fromthe search results page, that website may be able to identifythe search terms that the user has used.On addressing above issues, the searchable encryption (e.g.,[8], [9], [10]) has been recently developed as a fundamentalapproach to enable searching over encrypted cloud data,which proceeds the following operations. Firstly, the dataowner needs to generate several keywords according to theoutsourced data. These keywords are then encrypted and storedat the cloud server. When a search user needs to access theoutsourced data, it can select some relevant keywords andsend the ciphertext of the selected keywords to the cloudserver. The cloud server then uses the ciphertext to matchthe outsourced encrypted keywords, and lastly returns thematching results to the search user. To achieve the similarsearch efficiency and precision over encrypted data as that ofplaintext keyword search, an extensive body of research hasbeen developed in literature. Wang et al. [11] propose a rankedkeyword search scheme which considers the relevance scores1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing2of keywords. Unfortunately, due to using order-preservingencryption (OPE) [12] to achieve the ranking property, theproposed scheme cannot achieve unlinkability of trapdoor.Later, Sun et al. [13] propose a multi-keyword text searchscheme which considers the relevance scores of keywords andutilizes a multidimensional tree technique to achieve efficientsearch query. Yu et al. [14] propose a multi-keyword top-kretrieval scheme which uses fully homomorphic encryption toencrypt the index/trapdoor and guarantees high security. Caoet al. [6] propose a multi-keyword ranked search (MRSE),which applies coordinate machine as the keyword matchingrule, i.e., return data with the most matching keywords.Although many search functionalities have been developedin previous literature towards precise and efficient searchableencryption, it is still difficult for searchable encryption toachieve the same user experience as that of the plaintextsearch, like Google search. This mainly attributes to followingtwo issues. Firstly, query with user preferences is very popularin the plaintext search [15], [16]. It enables personalized searchand can more accurately represent user’s requirements, but hasnot been thoroughly studied and supported in the encrypteddata domain. Secondly, to further improve the user’s experienceon searching, an important and fundamental function isto enable the multi-keyword search with the comprehensivelogic operations, i.e., the “AND”, “OR” and “NO” operationsof keywords. This is fundamental for search users to prunethe searching space and quickly identify the desired data.Cao et al. [6] propose the coordinate matching search scheme(MRSE) which can be regarded as a searchable encryptionscheme with “OR” operation. Zhang et al. [17] propose aconjunctive keyword search scheme which can be regarded asa searchable encryption scheme with “AND” operation withthe returned documents matching all keywords. However, mostexisting proposals can only enable search with single logicoperation, rather than the mixture of multiple logic operationson keywords, which motivates our work.In this work, we address above two issues by developingtwo Fine-grained Multi-keyword Search (FMS) schemes overencrypted cloud data. Our original contributions can be summarizedin three aspects as follows:We introduce the relevance scores and the preference factorsof keywords for searchable encryption. The relevancescores of keywords can enable more precise returnedresults, and the preference factors of keywords representthe importance of keywords in the search keyword setspecified by search users and correspondingly enablespersonalized search to cater to specific user preferences. Itthus further improves the search functionalities and userexperience.We realize the “AND”, “OR” and “NO” operations in themulti-keyword search for searchable encryption. Comparedwith schemes in [6], [13] and [14], the proposedscheme can achieve more comprehensive functionalityand lower query complexity.We employ the classified sub-dictionaries technique toenhance the efficiency of the above two schemes. Extensiveexperiments demonstrate that the enhanced schemescan achieve better efficiency in terms of index building,trapdoor generating and query in the comparison withschemes in [6], [13] and [14].The remainder of this paper is organized as follows. InSection 2, we outline the system model, threat model, securityrequirements and design goals. In Section 3, we describethe preliminaries of the proposed schemes. We present thedeveloped schemes and enhanced schemes in details in Section4 and Section 5, respectively. Then we carry out the securityanalysis and performance evaluation in Section 6 and Section7, respectively. Section 8 provides a review of the relatedworks and Section 9 concludes the paper.2 SYSTEM MODEL, THREAT MODELAND SECURITY REQUIREMENTS2.1 System ModelAs shown in Fig. 1, we consider a system consists of threeentities.Data owner: The data owner outsources her data tothe cloud for convenient and reliable data access to thecorresponding search users. To protect the data privacy,the data owner encrypts the original data throughsymmetric encryption. To improve the search efficiency,the data owner generates some keywords for eachoutsourced document. The corresponding index is thencreated according to the keywords and a secret key. Afterthat, the data owner sends the encrypted documents andthe corresponding indexes to the cloud, and sends thesymmetric key and secret key to search users.Cloud server: The cloud server is an intermediate entitywhich stores the encrypted documents and correspondingindexes that are received from the data owner, andprovides data access and search services to search users.When a search user sends a keyword trapdoor to the cloudserver, it would return a collection of matching documentsbased on certain operations.Search user: A search user queries the outsourced documentsfrom the cloud server with following three steps.First, the search user receives both the secret key andsymmetric key from the data owner. Second, accordingto the search keywords, the search user uses the secretkey to generate trapdoor and sends it to the cloud server.Last, she receives the matching document collection fromthe cloud server and decrypts them with the symmetrickey.2.2 Threat Model and Security RequirementsIn our threat model, the cloud server is assumed to be “honestbut-curious”, which is the same as most related works onsecure cloud data search [13], [14], [6]. Specifically, the cloudserver honestly follows the designated protocol specification.However, the cloud server could be “curious” to infer andanalyze data (including index) in its storage and messageflows received during the protocol so as to learn additionalinformation. we consider two threat models depending on theinformation available to the cloud server, which are also usedin [13], [6].1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing3Fig. 1. System modelKnown Ciphertext Model: The cloud server can onlyknow encrypted document collection C and index collectionI, which are both outsourced from the data owner.Known Background Model: The cloud server can possessmore knowledge than what can be accessed inthe known ciphertext model, such as the correlationrelationship of trapdoors and the related statistical ofother information, i.e., the cloud server can possess thestatistical information from a known comparable datasetwhich bears the similar nature to the targeting dataset.Similar to [13], [6], we assume search users are trustedentities, and they share the same symmetric key and secretkey. Search users have pre-existing mutual trust with thedata owner. For ease of illustration, we do not considerthe secure distribution of the symmetric key and the secretkey between the data owner and search users; it can beachieved through regular authentication and secure channelestablishment protocols based on the prior security contextshared between search users and the data owner [18]. Inaddition, to make our presentations more focused, we donot consider following issues, including the access controlproblem on managing decryption capabilities given to usersand the data collection’s updating problem on inserting newdocuments, updating existing documents, and deleting existingdocuments, are separated issues. The interested readers onabove issues may refer to [6], [5], [10], [19].Based on the above threat model, we define the securityrequirements as follows:Confidentiality of documents: The outsourced documentsprovided by the data owner are stored in the cloud server.If they match the search keywords, they are sent to thesearch user. Due to the privacy of documents, they shouldnot be identifiable except by the data owner and theauthorized search users.Privacy protection of index and trapdoor: As discussed inSection 2.1, the index and the trapdoor are created basedon the documents’ keywords and the search keywords,respectively. If the cloud server identifies the content ofindex or trapdoor, and further deduces any associationbetween keywords and encrypted documents, it may learnthe major subject of a document, even the content of ashort document [20]. Therefore, the content of index andtrapdoor cannot be identified by the cloud server.Unlinkability of trapdoor: The documents stored in thecloud server may be searched many times. The cloudserver should not be able to learn any keyword informationaccording to the trapdoors, e.g., to determine twotrapdoors which are originated from the same keywords.Otherwise, the cloud server can deduce relationship oftrapdoors, and threaten to the privacy of keywords. Hencethe trapdoor generation function should be randomized,rather than deterministic. Even in case that two searchkeyword sets are the same, the trapdoors should bedifferent.3 PRELIMINARIESIn this section, we define the notation and review the securekNN computation and relevance score, which will serve as thebasis of the proposed schemes.3.1 NotationF—the document collection to be outsourced, denoted asa set of N documents F = (F1; F2; · · · ; FN).C—the encrypted document collection according to F,denoted as a set of N documents C = (C1;C2; · · · ;CN).FID—the identity collection of encrypted documents C,denoted as FID = (FID1; FID2; · · · ; FIDN).W—the keyword dictionary, including m keywords, denotedas W = (w1;w2; · · · ;wm).I—the index stored in the cloud server, which is builtfrom the keywords of each document, denoted as I =(I1; I2; · · · ; IN).fW—the query keyword set generated by a search user,which is a subset of W.TfW—the trapdoor for keyword set fW.]FID—the identity collection of documents returned tothe search user.FMS(CS)—the abbreviation of FMS and FMSCS.3.2 Secure kNN ComputationWe adopt the work of Wong et al. [21] as our foundation.Wong et al. propose a secure k-nearest neighbor (kNN) schemewhich can confidentially encrypt two vectors and computeEuclidean distance of them. Firstly, the secret key (S;M1;M2)should be generated. The binary vector S is a splitting indicatorto split plaintext vector into two random vectors, whichcan confuse the value of plaintext vector. And M1 and M2 areused to encrypt the split vectors. The correctness and securityof secure kNN computation scheme can be referred to [21].3.3 Relevance ScoreThe relevance score between a keyword and a documentrepresents the frequency that the keyword appears in thedocument. It can be used in searchable encryption for returningranked results. A prevalent metric for evaluating the relevancescore is TF × IDF, where TF (term frequency) representsthe frequency of a given keyword in a document and IDF1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing4(inverse document frequency) represents the importance ofkeyword within the whole document collection. Without lossof generality, we select a widely used expression in [22] toevaluate the relevance score asScore(fW; Fj) =ΣwfW1|Fj (1 + lnfj;w) · ln(1 +Nfw) (1)where fj;w denotes the TF of keyword w in document Fj ;fw denotes the number of documents contain keyword w; Ndenotes the number of documents in the collection; and |Fj |denotes the length of Fj , obtained by counting the number ofindexed keywords.4 PROPOSED SCHEMESIn this section, we firstly propose a variant of the secure kNNcomputation scheme, which serves as the basic framework ofour schemes. Furthermore, we describe two variants of ourbasic framework and the corresponding functionalities of themin detail.4.1 Basic FrameworkThe secure kNN computation scheme uses Euclidean distanceto select k nearest database records. In this section, we presenta variant of the secure kNN computation scheme to achievethe searchable encryption property.4.1.1 InitializationThe data owner randomly generates the secret key K =(S;M1;M2), where S is a (m+1)-dimensional binary vector,M1 and M2 are two (m + 1) × (m + 1) invertible matrices,respectively, and m is the number of keywords in W. Thenthe data owner sends (K; sk) to search users through a securechannel, where sk is the symmetric key used to encryptdocuments outsourced to the cloud server.4.1.2 Index buildingThe data owner firstly utilizes symmetric encryption algorithm(e.g., AES) to encrypt the document collection(F1; F2; · · · ; FN) with the symmetric key sk [23], the encrypteddocument collection are denoted as Cj(j = 1; 2; · · · ;N).Then the data owner generates an m-dimensional binaryvector P according to Cj(j = 1; 2; · · · ;N), where eachbit P[i] indicates whether the encrypted document containsthe keyword wi, i.e., P[i] = 1 indicates yes and P[i] = 0indicates no. Then she extends P to a (m + 1)-dimensionalvector P, where P[m + 1] = 1. The data owner usesvector S to split Pinto two (m + 1)-dimensional vectors(pa; pb), where the vector S functions as a splitting indicator.Namely, if S[i] = 0(i = 1; 2; · · · ;m + 1), pa[i] and pb[i]are both set as P[i]; if S[i] = 1(i = 1; 2; · · · ;m + 1),the value of P[i] will be randomly split into pa[i] and pb[i](P[i] = pa[i]+pb[i]). Then, the index of encrypted documentCj can be calculated as Ij = (paM1; pbM2). Finally, the dataowner sends Cj ||FIDj ||Ij (j = 1; 2; · · · ;N) to the cloudserver.4.1.3 Trapdoor generatingThe search user firstly generates the keyword set fW forsearching. Then, she creates a m-dimensional binary vector Qaccording to fW, where Q[i] indicates whether the i-th keywordof dictionary wi is in fW, i.e., Q[i] = 1 indicates yes andQ[i] = 0 indicates no. Further, the search user extends Q toa (m + 1)-dimensional vector Q, where Q[m + 1] = s(the value of s will be defined in the following schemesin detail). Next, the search user chooses a random numberr > 0 to generate Q′′ = r · Q. Then she splits Q′′ into two(m + 1) vectors (qa; qb): if S[i] = 0(i = 1; 2; · · · ;m + 1),the value of Q′′[i] will be randomly split into qa[i] and qb[i];if S[i] = 1(i = 1; 2; · · · ;m + 1), qa[i] and qb[i] are bothset as Q′′[i]. Thus, the search trapdoor TfW can be generatedas (M11 qa;M12 qb). Then the search user sends TfW to thecloud server.4.1.4 QueryWith the index Ij(j = 1; 2; · · · ;N) and trapdoor TfW, thecloud server calculates the query result asRj = Ij · TfW = (paM1; pbM2) · (M11 qa;M12 qb)= pa · qa + pb · qb = P· Q′′= rP· Q= r · (P · Q s)(2)If Rj > 0, the corresponding document identity FIDj willbe returned.Discussions: The Basic Framework has defined the fundamentalsystem structure of the developed schemes. Based onthe secure kNN computation scheme [21], the complementaryrandom parameter r further enhances the security. Differentvalues for parameter s and vectors P and Q can lead to newvariants of the Basic Framework. This will be elaborated inthe follows.4.2 FMS IIn the Basic Framework, P is a m-dimensional binary vector,and each bit P[i] indicates whether the encrypted documentcontains the keyword wi. In the FMS I, the data ownerfirst calculates the relevance score between the keyword wiand document Fj . The relevance score can be calculated asfollows:Score(wi; Fj) =1|Fj (1 + lnfj;wi ) · ln(1 +Nfwi) (3)where fj;wi denotes the TF of keyword wi in document Fj ;fwi denotes the number of documents contain keyword wi; Ndenotes the number of documents in the collection; and |Fj |denotes the length of Fj , obtained by counting the number ofindexed keywords.Then the data owner replaces the value of P[i] with thecorresponding relevance score. On the other hand, we alsoconsider the preference factors of keywords. The preferencefactors of keywords indicate the importance of keywords inthe search keyword set personalized defined by the searchuser. For a search user, he may pay more attention to thepreference factors of keywords defined by himself than therelevance scores of the keywords. Thus, our goal is that1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing5if a document has a keyword with larger preference factorthan other documents, it should have a higher priority inthe returned ]FID; and for two documents, if their largestpreference factor keywords are the same, the document withhigher relevance score of the keyword is the better matchingresult.As shown in Fig. 2, we replace the values of P[i] andQ[i] by the relevance score and the preference factor of akeyword, respectively (thus P and Q are no longer binary).The search user can dynamically adjust the preference factorsto achieve a more flexible search. For convenience, the scoreis rounded up, i.e., Score(wi; Fj) = 10 Score(wi; Fj),and we assume the relevance score is not more than D,i.e., Score(wi; Fj) < D. For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 n1 < n2 < · · · < nl m) whichis ordered by ascending importance, the search user randomlychooses a Σ super-increasing sequence (d1 > 0; d2; · · · ; dl) (i.e., j1i=1 di ·D < dj(j = 2; 3; · · · ; l)), where di is the preferencefactor of keyword wni . Then the search result would be:Rj = r · (P · Q s) = r · li=1Score(wni ; Fj) · di s) (4)Theorem 1: (Correctness) For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 n1 < n2 < · · · < nl m) whichis ordered by ascending preference factors, if F1 contains alarger preference factor keyword compared with F2, then F1has higher priority in the returned ]FID.Proof: For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl ), assume the keyword sets F1and F2 contain in fW are denoted as fW1 =(wni ; · · · ;wnx )(n1 ni < · · · < nx nl) andfW2 = (wnj ; · · · ;wny )(n1 nj < · · · < ny nl),respectively, where fW1 and fW2 are both ordered byascending preference factors, and nx > ny. As stated above,Score(wnx Σ ; Fj) 1 since the score is rounded up, and j1i=1 di · D < dj(j = 2; 3; · · · ; l). Therefore, there will beR2 = r · wnjgW2Score(wnj ; F2) · dj s)< r · yj=1Score(wnj ; F2) · dj s)< r · yj=1D · dj s) < r · (dx s)< r · (Score(wnx ; F1) · dx s)< r · wnigW1Score(wni ; F1) · di s)< R1(5)Therefore, F1 has higher priority in the returned ]FID.Theorem 2: (Correctness) For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl)(1 n1 < n2 < · · · < nl m)which is ordered by ascending preference factors, if the largestpreference factor keyword F1 contains is the same as thatF2 contains, and F1 have the higher relevance score of thekeyword, then F1 have higher priority in the returned ]FID.Proof: For the search keyword set fW =(wn1 ;wn2 ; · · · ;wnl ), assume the keyword sets F1 andF2 contain are denoted as fW1 = (wni ; · · · ;wnx )(n1 ni < · · · < nx nl) and fW2 = (wnj ; · · · ;wnx )(n1 nj < · · · < nx nl), respectively, where fW1 and fW2are both ordered by ascending preference factors andScore(wnx ; F1) Score(wnx ; F2) 1. Thus, there will beR1 =r · wnigW1Score(wni ; F1) · di s)r · (Score(wnx ; F1) · dx s)(7)R2 =r · wnjgW2Score(wnj ; F2) · dj s) (8)=r · (Score(wnx ; F2) · dxwnjgW2wnxScore(wnj ; F2) · dj s)<r · (Score(wnx ; F2) · dx wnjgW2wnxD · dj s)<r · (Score(wnx ; F2) · dx + dx s)R1 R2 > r · ((Score(wnx ; F1) Score(wnx ; F2)) · dx dx)> r · (dx dx)> 0 (9)Therefore, F1 have higher priority in the returned ]FID thanF2.Example. We present a concrete example to help understandTheorem 2. The example also illustrates the workingprocess of FMS I. Specifically, we assume that thesearch keyword set is fW = (wn1 ;wn2 ; · · · ;wn5 ), and thelargest preference factor keyword of sets F1 and F2 is thesame, which is wn4 . In addition, we assume the keywordsets F1 and F2 are fW1 = (wn2 ;wn3 ;wn4 ) and fW2 =(wn1 ;wn3 ;wn4 ) respectively. Furthermore, we assume thatthe relevance score is not more than D = 5, and specially,let Score(wn4 ; F1) = 4 and Score(wn4 ; F2) = 2,which satisfy Score(wn4 ; F1) Score(wn4 ; F2) = 2 1. we randomly choose a super-increasing sequence di ={1; 10; 60; 500; 3000}(i = 1; · · · ; 5), for arbitrary r > 0, therewill beR1 =r · wnigW1Score(wni ; F1) · di s) (11)r · (Score(wn4 ; F1) · d4 s)r · (4 · 500 s)r · (2000 s)1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing6Data owner Cloud server Search user111222 1 2 22keyword dictionary: ( , , )keywords of document : ( , , )corresponding ( , ): ( , , )in : ( , , , , , , , , )( 0 ,0 , , , , , , , ) ji jw wF w wScore w F S Sw w w w wP S S SP Pa ba b= ×××= ××××××××× ××× ××× ×××= ××× ××× ××× ×××® ¢®WWW W1 2( , )( , ) j a ab bp pI = p M p M´ jI_TW__ 2 1 211 22 1 2 2keyword dictionary: ( , , )search keyword set: ( , , )super-increasing sequence: ( , , )in : ( , , , , , , , , )( 0 , 0 , , , , , , l n n nln n nw ww w wd d dw w w w wQ d d da ba b= ×××= ××××××××× ××× ××× ×××= ××× ××× ×××WWW W_ 11 21, )( , )( , ) a ab bQ Q r Q q qT M q M q – -×××® ¢® × ¢®=1 W( )( ( , ) )larger , better result. j ilj ni j iR r P Q sr Score w F d sR== × × -= × _ × -Fig. 2. Structure of the FMS IR2 =r · wnjgW2Score(wnj ; F2) · dj s) (12)=r · (Score(wn4 ; F2) · d4+ΣwnjgW2wn4Score(wnj ; F2) · dj s)<r · (Score(wn4 ; F2) · dx wnjgW2wn4D · dj s)<r · (Score(wn4 ; F2) · d4 + d4 s)<r · (2 · 500 + 500 s)<r · (1500 s)R1 R2 >r · (2000 s) r · (1500 s) (13)>r · (2000 1500)>500 · r > 04.3 FMS IIIn the FMS II, we do not change the vector P in the BasicFramework, but replace the value of Q[i] by the weight ofsearch keywords, as shown in Fig. 3. With the weight ofkeywords, we can also implement some operations like “OR”,“AND” and “NO” in the Google Search to the searchableencryption.Assume that the keyword sets corresponding to the“OR”, “AND” and “NO” operations are (w1;w2; · · · ;wl1),(w′′1 ;w′′2 ; · · · ;w′′l2) and (w′′′1 ;w′′′2 ; · · · ;w′′′l3), respectively.Denote “OR”, “AND” and “NO” operations by , and, respectively. Thus the matching rule can be representedas (w1 w2 · · · wl1) (w′′1 w′′2 · · · w′′l2) (w′′′1 w′′′2 · · · w′′′l3). For “OR” operation,the search user chooses a super-increasing sequence(a1 > 0; a2; · · · ; al1 )(Σj1k=1 ak < aj(j = 2; · · · ; l1)) toachieve searching with keyword weight. To enable searchableencryption with “AND” and “NO” operations, the searchuser chooses a sequence (b1; b2; · · · ; bl2 ; c1; c2; · · · ; cl3 ),whereΣl1k=1 Σ ak < bh(h = 1; 2; · · · ; l2) and l1k=1 ak l2h=1 bh < ci(i = 1; 2; · · · ; l3).Assume (w1;w2; · · · ;wl1) are ordered by ascendingimportance, then according to the search keyword set(w1;w2; · · · ;wl1;w′′1 ;w′′2 ; · · · ;w′′l2;w′′′1 ;w′′′2 ; · · · ;w′′′l3),the corresponding values in Q are set as(a1; a2; · · · ; al1 ; b1; b2; · · · ; bl2 ;c1;c2; · · · ;cl3 ). Othervalues in Q are set as 0. Finally, the search user setss l2h=1 bh. In the Query phase, For a document Fj , ifthe corresponding Rj > 0, we claim that Fj can satisfy theabove matching rule.Theorem 3: (Correctness) Fj satisfies the above matchingrule with “OR”, “AND” and “NO” if and only if the correspondingRj > 0.Proof: Firstly, we proof the completeness. Since the weightof w′′′Σ i (i = 1; 2; · · · ; l3) in the vector Q is ci and ci > l1k=1 ak l2h=1 bh, if any corresponding value of w′′′i in Pof Fj is 1, we can infer P ·Q < 0 and Rj = r·(P ·Qs) < 0.Therefore, if Rj > 0, any of w′′′i is not in the keyword set ofFj , i.e., Fj satisfies the “NO” operation. Moreover, if Rj > 0,then r · (P · Q s) = r · (P · Q Σl2h=1 bh) > 0. Sincebh >Σl1k=1 ak(h = 1; 2; · · · ; l2), all corresponding values ofw′′h in P have to be 1 and at least one corresponding value ofwk(k = 1; 2; · · · ; l1) in P should be 1. Thus, Fj satisfies the“AND” and “OR” operations. Therefore, if Rj > 0, the vectorP satisfies the operations of “OR”, “AND” and “NO”.Next, we show the soundness. If the vector P satisfiesthe operations of “OR”, “AND” and “NO”, i.e., at least onecorresponding value of keyword wk in P is 1 (assume thiskeyword is w(1  l1)), all corresponding values ofkeywords w′′h in P are 1 and no corresponding value ofkeyword w′′′i in P is 1. Therefore, Rj = r · (P · Q s) r · (a + b1 + b2 + · · · + bl2s) = r · a > 0.Example.We present a concrete example to help understandTheorem 3. The example also illustrates the working processof FMS II. Specifically, we assume that the keyword setscorresponding to the “OR”, “AND” and “NO” operations are(w1;w2;w3), (w′′1 ;w′′2 ;w′′3 ) and (w′′′1 ;w′′′2 ), respectively. Thus,the matching rule can be represented as (w1 w2 w3) (w′′1 w′′2 w′′3 ) (w′′′1 w′′′2 ). we assume that the searchweights (a1; a2; a3), (b1; b2; b3) and (c1; c2) for “OR”, “AND”and “NO” are(1,5,8), (20,24,96) and (-500,-600), respectively.We firstly prove Rj > 0 when Fj satisfies the matchingrule. Specifically, assume that Fj satisfies the matching rulew2(w′′1w′′2w′′3 )(w′′′1w′′′2 ). Thus the correspondingvalues of vector P are (0; 1; 0), (1; 1; 1) and (0; 0), respectively.Thus, the result of s =Σ3h=1 bh = 20 + 24 + 96 = 140,1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing7Data owner Cloud server Search user´ jI_T1 2 2 1 12 2 W1 2keyword dictionary: ( , , )keywords of document : ( , , )in : ( , , , , , , , , )( 0, 0 , , 1 , , 1 , , 1 , )( , )( , ) jj a ab bw wF w ww w w w wPP P p pI p M p Ma b= ×××= ×××××× ××× ××× ×××= ××× ××× ××× ×××® ¢®=WWW W 12 121 2 1122 11 22 1 2keyword dictionary: ( , , )operation keyword set keyword weightOR ( , , , ) ( , , , )AND ( , , , ) ( , , , )NO ( , , , ll llw ww w w a a aw w w b b bw w= ×××¢ ¢ ××× ¢ ×××¢ ¢ ××× ¢ ×××¢¢ ¢¢ ×××W_1 2 3 1 2 311 21) ( , , , )in : ( , , , , , , , , )( 0, 0 , , , , , , , )( , )( , ) l la ab bw c c cw w w w wQ a c bQ Q r Q q qT M q M qa b ca b c- -¢¢ ×××××× ¢ ××× ¢¢ ××× ¢ ×××= ××× ××× – ××× ×××® ¢® × ¢®=WW( )( )0 satisfy “OR”,”AND” and “NO”larger larger weight of “OR” j j jR r P Q sr a c b sRRa b c= × × -= × ×××+ +×××- +×××+ + ×××-> __Fig. 3. Structure of the FMS IIfor arbitrary r > 0, the result of Rj will beRj = r · (P · Q s)= r · (a2 + b1 + b2 + b3 s)= r · (5 + 20 + 24 + 96 140)= 5r > 0(14)From the above example, we can easily see that Rj > 0when Fj satisfies the matching rule. Next, we show thatRj < 0 when Fj does not satisfy the matching rule. Especially,we assume that the ”AND” operation does not satisfy thematching rule. Here, we set the first keyword does not matchthe rule, therefore the search keyword set of ”AND” operationsare (0; 1; 1) instead of (1; 1; 1). Thus, the result of Ri will beRj = r · (P · Q s)= r · (a2 + b2 + b3 s)= r · (5 + 24 + 96 140)= 15r < 0(15)Obviously, Rj < 0 when Fj does not satisfy the matchingrule.5 ENHANCED SCHEMEIn practice, apart from some common keywords, other keywordsin dictionary are generally professional terms, and thispart of the dictionary will rapidly increase when the dictionarybecomes larger and more comprehensive. Simultaneously, thedata owner’s index will become longer, although many dimensionsof keywords will never appear in her documents.That will cause redundant computation and communicationoverhead.In this section, we further propose a Fine-grained MultikeywordSearch scheme supporting Classified Sub-dictionaries(FMSCS), which classifies the total dictionary as a commonsub-dictionary and many professional sub-dictionaries. Ourgoal is to significantly reduce the computation and communicationoverhead. We have researched in a file set randomlychosen from the National Science Foundation (NSF) ResearchAwards Abstracts 1990-2003 [24]. As shown in Fig. 4, weclassify the total dictionary to many sub-dictionaries suchas common sub-dictionary, computer science sub-dictionary,mathematics sub-dictionary and physics sub-dictionary, etc.And the search process will only be some minor changes inInitialization.Change of Initialization: Compared with theBasic Framework, in the enhanced scheme thedata owner should firstly choose corresponding subdictionaries.Then her own dictionary can be combined as{f1||Subdic1||f2||Subdic2|| · · · }, where Subdici representsall keywords contained in corresponding sub-dictionary andfi is filling factor with random length which will be 0 stringin the index, the filling factor is used to confuse length ofthe data owner’s own dictionary and relative positions of subdictionaries.Then, the data owner and search user will use thisdictionary to generate the index and trapdoor, respectively.Note that in an dictionary, two professional sub-dictionariescan even contain a same keyword, but only the first appearedkeyword will be used to generate index and trapdoor, anotherwill be set to 0 in the vector. And the secret key K willbe formed as (S;M1;M2; |f1|;DID1 ; |f2|;DID2 ; · · · ), whereDIDi represents the identity of sub-dictionary and |fi| is thelength of fi. Other than these changes, the remaining phases(i.e., Index building, Trapdoor generating and Query) aresame as the Basic Framework.Dictionary Updating: In the searchable encryptionschemes with dictionary, dictionary update is a challengeproblem because it may cause to update massive indexesoutsourced to the cloud server. In general dictionary-basedsearch schemes, e.g., [13] and [14], the update of dictionarywill lead to re-generation of all indexes. In our FMSCSschemes, when it needs to change the sub-dictionaries or addnew sub-dictionaries, only the data owners who use the correspondingsub-dictionaries need to update their indexes, mostother data owners do not need to do any update operations.Such dictionary update operations are particularly lightweight.In addition, Li et al. [9] utilize the dimension expansiontechnique to implement the efficient dictionary expansion.Such method can also be included into our dictionary updatingprocess. And our scheme can even be more efficient than [9]since although [9] does not need to re-generate all indexes,but the corresponding extended operations on all indexes arenecessary. In comparison, our schemes only need to extendthe indexes of partial data owners.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing8________ ______    _________ ___________ _________ _______     ___ _______ _        ____ _________ ______________ _     _______  ___ ________ __                _              __ ______ _______ ___         __ _______ _______________ ___       __ __________ __________ ______ ___ ___ ___ _________ _!          _” __________1 2 3 4Dictionary: || common || || computer science f f || f ||mathematics || f 1 2 3 4Dictionary: f ¢|| physics || f ¢ || mathematics || f ¢|| common || f ¢ _________ __________            __Fig. 4. Classified sub-dictionaries6 SECURITY ANALYSISIn this section, we analyze the main security properties ofthe proposed schemes. In particular, our analysis focuseson how the proposed schemes can achieve confidentialityof documents, privacy protection of index and trapdoor, andunlinkability of trapdoor. Other security features are not thefocus of our concern.6.1 Confidentiality of DocumentsIn our schemes, the outsourced documents are encrypted bythe traditional symmetric encryption algorithm (e.g., AES). Inaddition, the secret key sk is generated by the data owner andsent to the search user through a secure channel. Since theAES encryption algorithm is secure [23], any entity cannotrecover the encrypted documents without the secret key sk.Therefore, the confidentiality of encrypted documents can beachieved.6.2 Privacy Protection of Index and TrapdoorAs shown in Section 4.1, both the index Ij = (paM1; pbM2)and the trapdoor TfW = (M11 qa;M12 qb) are ciphertextsof vectors (P;Q). The secret key is K = (S;M1;M2) inthe FMS or (S;M1;M2; |f1|;DID1 ; |f2|;DID2 ; · · · ) in theFMSCS, where S functions as a splitting indicator whichsplits P and Q into (pa; pb) and (qa; qb), respectively, twoinvertible matrices M1 and M2 are used to encrypt (pa; pb)and (qa; qb). The security of this encryption algorithm has beenproved in the known ciphertext model [21]. Thus, the contentof index and trapdoor cannot be identified. Therefore, privacyprotection of index and trapdoor can be achieved.6.3 Unlinkability of TrapdoorTo protect the security of search, the unlinkability of trapdoorshould be achieved. Although the cloud server cannotdirectly recover the keywords, the linkability of trapdoor maycause leakage of privacy, e.g., the same keyword set may besearched many times, if the trapdoor generation function isdeterministic, even though the cloud server cannot decryptthe trapdoors, it can deduce the relationship of keywords. Weconsider whether the trapdoor TfW = (M11 qa;M12 qb) can belinked to the keywords. We prove our schemes can achieve theunlinkability of trapdoor in a strong threat model, i.e., knownbackground model [6].Known Background Model: In this model, the cloudserver can possess the statistical information from a knowncomparable dataset which bears the similar nature to thetargeting dataset.TABLE 1Structure of QQ[1] _ _ _Q[m] Q[m + 1]FMS(CS) I _ _ _ 0 _ _ _ di _ _ _ 0 _ _ _ dj _ _ _ 􀀀sFMS(CS) II _ _ _ ak _ _ _ bh _ _ _ 0 _ _ _ ci _ _ _ 􀀀sAs shown in Table 1, in our FMS(CS) I, the trapdooris constituted by two parts. The values of all dimensionsdi(i = 1; 2; · · · ; l) are the super-increasing sequence randomlychosen by the search user (assume there are _ possiblesequences). And the (m+ 1) dimension is s defined by thesearch user, where s is a positive random number. Assumethe size of s is _s bits, there are 2_s possible values fors. Further, to generate Q′′ = r · Q, Qis multiplied by apositive random number r, there are 2_r possible values forr (if the search user chooses _r-bit r). Finally, Q′′ is splitto (qa; qb) according the splitting indicator S. Specifically, ifS[i] = 0(i = 1; 2; · · · ;m + 1), the value of Q′′[i] will berandomly split into qa[i] and qb[i], assume in S the numberof ‘0’ is _, and each dimension of qa and qb is _q bits. Notethat _s, _r, _ and _q are independent of each other. Thenin our FMS(CS) I, we can compute the probability that twotrapdoors are the same as follows:P1 =1_ · 2_s · 2_r · (2_q )_ =1_ · 2_s+_r+__q(16)Therefore, the larger _, _s, _r, _ and _q can achieve thestronger security, e.g., if we choose 1024-bit r, then the1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing9probability P1 < 1=21024. As a result, the probability thattwo trapdoors are the same is negligible.And in the FMS(CS) II, because s = Σl2h=1 bh,its value depends on the weight sequence(a1; a2; · · · ; al1 ; b1; b2; · · · ; bl2 ; c1; c2; · · · ; cl3 ). Assumethe number of different sequences is denoted as _, then wecan compute:P2 =1_ · 2_r · (2_q )_ =1_ · 2_r+__q(17)Similarly, in the FMS(CS) II and the FMS(CS) III, the probabilitythat two trapdoors are the same is negligible. Therefore,in our schemes, the unlinkability of trapdoor can be achieved.In summary, we present the comparison results of securitylevel in Table 2, where I and II represent FMS(CS) I andFMS(CS) II, respectively. It can be seen that all schemes canachieve confidentiality of documents and privacy protectionof index and trapdoor, but the OPE schemes [11], [25] cannotachieve the unlinkability of trapdoor very well because of thesimilarity relevance mentioned in [14].TABLE 2Comparison of Security Level[11], [25] [6], [13], [14] I IIConfidentialityp p p pPrivacy protectionp p p pUnlinkabilityp p pDiscussions: In MRSE [6], the values of P ·Q are equal tothe number of matching keywords, which suffers scale analysisattack when the cloud server is powerful and has knowledgeof some background information. To solve this problem, itextends the index and inserts a random number j whichfollows a normal distribution and can confuse the values ofP ·Q. Thus, enhanced MRSE can resist scale analysis attack.However, the introduction of j causes precision decrease ofthe returned results. There is a trade-off between precisionand security in MRSE. In comparison, our schemes do notsuffer the scale analysis attack. Because the values of P · Qin our schemes do not disclose any information due to therandomly selected sequences mentioned in Section 4.2 andSection 4.3. Therefore, our proposal can achieve the securitywithout sacrificing precision.7 PERFORMANCE EVALUATIONIn this section, we evaluate the performance of the proposedschemes using simulations, and compare the performance withthat of existing proposals in [6], [13], [14]. We apply a realworlddataset from the National Science Foundation ResearchAwards Abstracts 1990-2003 [24], in which we random selectmultiple documents and conduct real-world experiments on anIntel Core i5 3.2 GHz system.7.1 FunctionalityWe compare functionalities between [6], [13], [14] and ourschemes in Table 3, where I and II represent FMS(CS) I andFMS(CS) II, respectively.MRSE [6] can achieve multi-keyword search and coordinatematching using secure kNN computation scheme. And [13]and [14] considers the relevance scores of keywords. Comparedwith the other schemes, our FMS(CS) I considers boththe relevance scores and the preference factors of keywords.Note that if the search user sets all relevance scores andpreference factors of keywords as the same, the FMS(CS) Idegrades to MRSE and the coordinate matching can beachieved. And in the FMS(CS) II, if the search user sets allpreference factors of “OR” operation keywords as the same,the FMS(CS) II can also achieve the coordinate matchingof “OR” operation keywords. Particularly, the FMS(CS) IIachieves some fine-grained operations of keyword search,i.e., “AND”, “OR” and “NO” operations in Google Search,which are definitely practical and significantly enhance thefunctionalities of encrypted keyword search.TABLE 3Comparison of Functionalities[6] [13] [14] I IIMulti-keyword searchp p p p pCoordinate matchingp p p p pRelevance scorep p pPreference factorp pAND OR NO operationsp7.2 Query ComplexityIn the FMS(CS) II, we can implement “OR”, “AND” and“NO” operations by defining appropriate weights of keywords,this scheme provides a more fine-grained search than [6],[13] and [14]. If the keywords to perform “OR”, “AND” and“NO” operations are (w1;w2; · · · ;wl1), (w′′1 ;w′′2 ; · · · ;w′′l2)and (w′′′1 ;w′′′2 ; · · · ;w′′′l3), respectively. Our FMS(CS) II cancomplete the search with only one query, however, in [6],[13] and [14], they would complete the search through thefollowing steps:For the “OR” operation of l1 keywords, they need onlyone query Query(w1;w2; · · · ;wl1) to return a collectionof documents with the most matching keywords (i.e.,coordinate matching), which can be denoted as X =Query(w1;w2; · · · ;wl1).For the “AND” operation of l2 keywords, [6], [13]and [14] cannot generate a query for multiple keywordsto achieve the “AND” operation. Therefore, aftercosting l2 queries Query(w′′i )(i = 1; 2; · · · ; l2),they can do the “AND” operation, and the correspondingdocument set can be denoted as Y =Query(w′′1 )∩Query(w′′2 )∩· · ·Query(w′′l2).For the “NO” operation of l3 keywords, they need l3queries Query(w′′′i )(i = 1; 2; · · · ; l3), firstly. Then, thedocument set of the “NO” operation can be denoted asZ = Query(w′′′1 )∩Query(w′′′2 )∩· · ·Query(w′′′l3).Finally, the document collection achieved “OR”, “AND”and “NO” operations can be represented as XYZ.As shown in Fig. 5a, 5b and 5c, to achieve these operations,the FMS(CS) II can outperform the existing proposals withless queries generated.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing102468102468100510152025Number Number of “NO” keywords of “AND” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(a)24681024681005101520Number of “NO” keywords Number of “OR” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(b)24681024681005101520Number of “OR” keywords Number of “AND” keywordsNumber of queries[6] & [13] & [14]FMS(CS)_II(c)Fig. 5. Time for building index. (a) Number of queriesfor the different number of “AND” and “NO” keywordswith the same number of “OR” keywords, l1 = 5. (b)Number of queries for the different number of “OR” and“NO” keywords with the same number of “AND” keywords,l2 = 5. (c) Number of queries for the different number of“AND” and “OR” keywords with the same number of “NO”keywords, l3 = 5.7.3 Efficiency7.3.1 Computation overheadIn order to easily demonstrate our scheme computation overhead,we analysis our scheme from each phase.Index building. Note that the Index building phase of [6]is the same as our FMS II scheme, without calculating therelevance score. And the Index building phase of the FMS Iis the same as [13], containing the relevance score computing.Compared with the FMS I, the FMS II does not need to calculatethe relevance score. And compared with the computationcost of building index, the cost of calculating the relevancescore is negligible, we do not distinguish them. Moreover,in our enhanced schemes (FMSCS), we divide the totaldictionary into 1 common sub-dictionary and 20 professionalsub-dictionaries (assume each data owner averagely chooses 1common sub-dictionary and 3 professional sub-dictionaries togenerate the index). As shown in Fig. 6, we can see the time forbuilding index is dominated by both the size of dictionary andthe number of documents. And compared with [6], [13], [14]and our FMS schemes, the FMSCS schemes largely reducethe computation overhead.Trapdoor generating. In Trapdoor generating phase, [6]and [13] firstly creates a vector according to the searchkeyword set fW, then encrypts the vector by the secure kNNcomputation scheme. And [14] also generates a vector anduses homomorphic encryption to encrypt each dimension. Incomparison, our FMS I and FMS II schemes should firstly2000 4000 6000 8000 100000500100015002000Size of dictionaryTime (s)[6] & [13] & FMS[14]FMSCS(a)2000 4000 6000 8000 100000100200300400500600Number of documentsTime (s)[6] & [13] & FMS[14]FMSCS(b)Fig. 6. Time for building index. (a) For the different size ofdictionary with the same number of documents, N=6000.(b) For the different number of documents with the samesize of dictionary, |W| = 4000.2000 4000 6000 8000 10000020040060080010001200Size of dictionaryTime (ms)[6] & [13] & FMS[14]FMSCS(a)10 20 30 40 50050100150200250300Number of query keywordsTime (ms)[6] & [13] & FMS[14]FMSCS(b)Fig. 7. Time for generating trapdoor. (a) For the differentsize of dictionary with the same number of query keywords,|fW|=20. (b) For the different number of query keywordswith the same size of dictionary, |W| = 4000.generate a super-increasing sequence and a weight sequence,respectively. But actually, we can pre-select a correspondingsequence for each scheme, it can also achieve search processand privacy. Because even if the vectors are the same formultiple queries, the trapdoors will be not the same due tothe security of kNN computation scheme. Therefore, the computationcost of [6], [13] and all FMS schemes in Trapdoorgenerating phase are the same. As shown in Fig. 7, the timefor generating trapdoor is dominated by the size of dictionary,instead of the number of query keywords. Hence, our FMSCSschemes are also very efficient in Trapdoor generating phase.Query. As [6], [13] and the FMS all adopt the secure kNNcomputation scheme, the time for query is the same. Thecomputation overhead in Query phase, as shown in Fig. 8,is greatly affected by the size of dictionary and the numberof documents, and almost has no relation to the number ofquery keywords. Further we can see, our FMSCS schemessignificantly reduce the computation cost in Query phase.As [14] needs to encrypt each dimension of index/trapdoorusing full homomorphic encryption, its index/trapdoor size isenormous. Note that, in Trapdoor generating and Queryphases, the computation overheads are not affected by thenumber of query keywords. Thus our FMS and FMSCSschemes are more efficient compared with some multiplekeywordsearch schemes [26], [27], as their cost is linear withthe number of query keywords.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing112000 4000 6000 8000 10000010203040506070Size of dictionaryTime (s)[6] & [13] & FMS[14]FMSSCS(a)2000 4000 6000 8000 1000001020304050Number of documentsTime (s)[6] & [13] & FMS[14]FMSCS(b)10 20 30 40 5005101520253035Number of query keywordsTime (s)[6] & [13] & FMS[14]FMSCS(c)Fig. 8. Time for query. (a) For the different size ofdictionary with the same number of documents and numberof search keywords, N = 6000; |fW| = 20. (b) Forthe different number of documents with the same sizeof dictionary and number of search keywords, |W| =4000; |fW| = 20. (c) For the different number of searchkeyword with the same size of dictionary and number ofdocuments, N = 6000; |W| = 4000.7.3.2 Storage overheadAs shown in Table 4, we provide a comparison of storageoverhead among several schemes. Specifically, we evaluate thestorage overhead from three parts: the data owner, the searchuser and the cloud server.According to Table 4, in the FMS, the FMSCS as well asschemes of [6] and [13], the storage overhead of the dataowner are the same. In these schemes, the data owner preservesher secret key K = (S;M1;M2) and symmetric key sklocally, where S is an (m+1)-dimensional vector, M1 and M2are (m+1)×(m+1) invertible matrices. All elements in S,M1and M2 are the float number. Since the size of a float numberis 4 bytes, the size of K is 4· (m+1)+8· (m + 1)2 bytes. Weassume that the size of sk is Ssk that is a constant. Thus, thetotal size of storage overhead is 4·(m+1)+8·(m + 1)2+Sskbytes. However, in [14], the storage overhead of data owneris _5=8 bytes, where the _ is the secure parameter. Thestorage overhead is 4GB when we choose _ = 128, which ispopular in a full homomorphic encryption scheme. However,the storage overhead of the FMS and the FMSCS are almost763MB when we choose m = 10000, which is large enoughfor a search scheme. Therefore, the FMS and the FMSCS aremore efficient than scheme in [14] in terms of the storageoverhead of the data owner.As shown in Table 4, a search user in the FMS, the FMSCSas well as the schemes of [6] and [13] preserves the secret keyK = (S;M1;M2) and the symmetric key sk locally. Therefore,the total storage overhead is 4(m+1)+8(m + 1)2+Sskbytes. However, in [14], the storage overhead is _5=8 + _2=8bytes. The storage overhead is 4GB when we choose _ = 128,which is popular in a full homomorphic encryption scheme.However, the storage overhead of the FMS and the FMSCSare almost 763MB when we choose m = 10000, which islarge enough for a search scheme. Therefore, the FMS andthe FMSCS are more efficient than scheme in [14] in termsof the storage overhead of the search user.The cloud server preserves the encrypted documents and theindexes. The size of encrypted documents in all schemes arethe same, i.e., N·Ds. For the indexes, in the FMS and schemesin [6] and [13], the storage overhead are 8 · (m+1) ·N bytes.In the FMSCS, the storage overhead is 8·· (m+1) ·N bytes,where 0 < ” < 1. When m = 1000 and N = 10000 whichare large enough for a search scheme, the storage overhead ofindexes is about 132MB in the FMSCS. And in schemes of [6]and [13] as well as the FMS, the size of indexes is 760MB withthe same conditions. In scheme in [14], the storage overheadof indexes is N · Ds + m · N · (_=8)5 bytes, it is 4GB whenwe choose _ = 128, which is popular in a full homomorphicencryption scheme. Therefore, the FMS and the FMSCS aremore efficient than scheme in [14] in terms of the storageoverhead of the cloud server.7.3.3 Communication overheadAs shown in Table 5, we provide a comparison of communicationoverhead among several schemes. Specifically,we consider the communication overhead from three parts:the communication between the data owner and the cloudserver (abbreviated as D-C), the communication between thesearch user and the cloud server (abbreviated as C-S) and thecommunication between the data owner and the search user(abbreviated as D-S).D-C. In the FMS as well as schemes of [6] and [13], the dataowner needs to send information to cloud server in the formof Cj ||FIDj ||Ij (j = 1; 2; · · · ;N), where the Cj representsthe encrypted documents, FIDj represents the identity of thedocument and Ij represents the index. We assume that theaverage size of documents is Ds, thus the size of documentsis N ·Ds. We assume the encrypted documents identity FIDis a 10-byte string. Thus, the total size of the identity FIDis 10 · N bytes. The index Ij = (paM1; pbM2) contains two(m+1)-dimensional vectors. Each dimension is a float number(the size of each float is 4 bytes). Thus, the total size of index is8·(m+1)·N bytes. Therefore, the total size of communicationoverhead is 8·(m+1)·N+10·N+N·Ds bytes. In the FMSCS,the total size of communication overhead is 8·· (m+1)·N +10·N+N·Ds bytes. If we choose the as 0:2, the size of indexis 1:6 · (m+1) ·N bytes, and the total size of communicationof FMSCS is 1:6·(m+1)·N+10·N+Ds ·N bytes. However,in [14], the communication overhead is N ·Ds +m·N · _5=8bytes, where _ is the secure parameter. If we choose _ = 128which is popular in a full homomorphic encryption schemeand m = 1000 and N = 10000 which are large enough fora search scheme, the FMS and the FMSCS are more efficientthan scheme in [14] in terms of the communication overheadof D-C.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing12TABLE 4Comparison of Storage Overhead (Bytes). (m represents the size of dictionary; N represents the number ofdocuments; Ds represents the average size of each encrypted document; _ represents the secure parameter; represents the decrease rate of dictionary by using our classified sub-dictionaries technology; Ssk represents the sizeof symmetric key.)[14] [6], [13] and FMS FMSCSData Owner _5=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskSearch User _5=8 + _2=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskCloud Server N _ Ds + m _ N _ _5=8 N _ Ds + 8 _ (m + 1) _ N N _ Ds + 8 _ _ (m + 1) _ NTABLE 5Comparison of Communication Overhead (Bytes). (m represents the size of dictionary; N represents the number ofdocuments; Ds represents the average size of each encrypted document; T represents the number of returneddocuments; _ represents the secure parameter; represents the decrease rate of dictionary by using our classifiedsub-dictionaries technology; Ssk represents the size of symmetric key.)[14] [6], [13] and FMS FMSCSD-C N _ Ds + m _ N _ _5=8 8 _ (m + 1) _ N + 10 _ N + N _ Ds 8 _ _ (m + 1) _ N + 10 _ N + N _ DsC-S m _ _5=8 + T _ Ds 8 _ (m + 1) + T _ Ds 8 _ _ (m + 1) + T _ DsD-S _5=8 + _2=8 4 _ (m + 1) + 8 _ (m + 1)2 + Ssk 4 _ (m + 1) + 8 _ (m + 1)2 + SskC-S. The C-S consists of two phases: Query and Resultsreturning. In the Query phase, a search user in the FMS as wellas the schemes in [6] and [13] sends the trapdoor to the cloudserver in the form of TfW = (M11 qa;M12 qb), which containstwo (m+1)-dimensional vectors. Thus, the communicationoverhead is 8·(m+1) bytes. In the FMSCS, the communicationoverhead is 8 · · (m + 1)(0 < ” < 1) bytes. In the Resultsreturning phase, the cloud server sends the correspondingresult to the search user. The communication overhead of CSincreases along with the number of returned documentsat this point. We assume that the number of the returneddocuments is T, thus, the total communication overhead ofcloud server to search user is T · Ds bytes. Therefore, thetotal communication overhead of C-S is 8 ·m+T ·Ds bytes.In the FMS as well as the schemes in [6] and [13], the totalcommunication overhead of C-S is 8 · · (m + 1) + T · Dsbytes. In [14], the total communication overhead of C-S ism·_5=8+T ·Ds bytes. If we choose _ = 128 which is popularin a full homomorphic encryption scheme and m = 1000 andN = 10000 which are large enough for a search scheme, theFMS and the FMSCS are more efficient than scheme in [14]in terms of the communication overhead of C-S.D-S. From table 5, we can see that the communicationoverhead of the FMS, the FMSCS as well as schemes in[6] and [13] are the same. In the Initialization phase, thedata owner sends the secret key K = (S;M1;M2) andsymmetric key sk to the search user, where S is an (m+ 1)-dimensional vector, M1 and M2 are (m + 1) × (m + 1)invertible matrices. Thus, the size of the secret key K is4 · (m + 1) + 8 · (m + 1)2 bytes. Therefore, the total sizeof communication overhead is 4 · (m+1)+8 · (m + 1)2+Sskbytes, where the Ssk represents the size of symmetric key.However, the communication overhead of scheme in [14] is_5=8+_2=8 bytes. The communication overhead is 4GB whenwe choose _ = 128, which is popular in a full homomorphicencryption scheme. However, the communication overhead ofthe FMS and the FMSCS are almost 763MB when we choosem = 10000, which is large enough for a search scheme.Therefore, the FMS and the FMSCS are more efficient thanscheme in [14] in terms of the communication overhead ofD-S.8 RELATED WORKThere are mainly two types of searchable encryption in literature,Searchable Public-key Encryption (SPE) and SearchableSymmetric Encryption (SSE).8.1 SPESPE is first proposed by Boneh et al. [28], which supportssingle keyword search on encrypted data but the computationoverhead is heavy. In the framework of SPE, Boneh et al. [27]propose conjunctive, subset, and range queries on encrypteddata. Hwang et al. [29] propose a conjunctive keyword schemewhich supports multi-keyword search. Zhang et al. [17] proposean efficient public key encryption with conjunctivesubsetkeywords search. However, these conjunctive keywordsschemes can only return the results which match all thekeywords simultaneously, and cannot rank the returned results.Qin et al. [30] propose a ranked query scheme which usesa mask matrix to achieve cost-effectiveness. Yu et al. [14]propose a multi-keyword top-k retrieval scheme with fullyhomomorphic encryption, which can return ranked results andachieve high security. In general, although SPE allows moreexpressive queries than SSE [13], it is less efficient, andtherefore we adopt SPE in the work.8.2 SSEThe concept of SSE is first developed by Song et al. [8].Wang et al. [25] develop the ranked keyword search scheme,which considers the relevance score of a keyword. However,the above schemes cannot efficiently support multi-keywordsearch which is widely used to provide the better experience1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing13to the search user. Later, Sun et al. [13] propose a multikeywordsearch scheme which considers the relevance scoresof keywords, and it can achieve efficient query by utilizingthe multidimensional tree technique. A widely adopted multikeywordsearch approach is multi-keyword ranked search(MRSE) [6]. This approach can return the ranked results ofsearching according to the number of matching keywords. Liet al. [10] utilize the relevance score and k-nearest neighbortechniques to develop an efficient multi-keyword searchscheme that can return the ranked search results based on theaccuracy. Within this framework, they leverage an efficientindex to further improve the search efficiency, and adopt theblind storage system to conceal access pattern of the searchuser. Li et al. [19] also propose an authorized and ranked multikeywordsearch scheme (ARMS) over encrypted cloud databy leveraging the ciphertext policy attribute-based encryption(CP-ABE) and SSE techniques. Security analysis demonstratesthat the proposed ARMS scheme can achieve collusion resistance.In this paper, we propose FMS(CS) schemes which notonly support multi-keyword search over encrypted data, butalso achieve the fine-grained keyword search with the functionto investigate the relevance scores and the preference factors ofkeywords and, more importantly, the logical rule of keywords.In addition, with the classified sub-dictionaries, our proposalis efficient in terms of index building, trapdoor generating andquery.9 CONCLUSIONIn this paper, we have investigated on the fine-grained multikeywordsearch (FMS) issue over encrypted cloud data, andproposed two FMS schemes. The FMS I includes both therelevance scores and the preference factors of keywords toenhance more precise search and better users’ experience,respectively. The FMS II achieves secure and efficient searchwith practical functionality, i.e., “AND”, “OR” and “NO”operations of keywords. Furthermore, we have proposed theenhanced schemes supporting classified sub-dictionaries (FMSCS)to improve efficiency.For the future work, we intend to further extend the proposalto consider the extensibility of the file set and the multi-usercloud environments. Towards this direction, we have madesome preliminary results on the extensibility [5] and the multiusercloud environments [19]. Another interesting topic is todevelop the highly scalable searchable encryption to enableefficient search on large practical databases.ACKNOWLEDGMENTThis work is supported by the National Natural ScienceFoundation of China under Grants 61472065, 61350110238,61103207, U1233108, U1333127, and 61272525, the InternationalScience and Technology Cooperation and ExchangeProgram of Sichuan Province, China under Grant 2014HH0029,China Postdoctoral Science Foundation funded projectunder Grant 2014M552336, and State Key Laboratory ofInformation Security Open Foundation under Grant 2015-MS-02.REFERENCES[1] H. Liang, L. X. Cai, D. Huang, X. Shen, and D. Peng, “An smdpbasedservice model for interdomain resource allocation in mobile cloudnetworks,” IEEE Transactions on Vehicular Technology, vol. 61, no. 5,pp. 2222–2232, 2012.[2] M. M. Mahmoud and X. Shen, “A cloud-based scheme for protectingsource-location privacy against hotspot-locating attack in wireless sensornetworks,” IEEE Transactions on Parallel and Distributed Systems,vol. 23, no. 10, pp. 1805–1818, 2012.[3] Q. Shen, X. Liang, X. Shen, X. Lin, and H. Luo, “Exploiting geodistributedclouds for e-health monitoring system with minimum servicedelay and privacy preservation,” IEEE Journal of Biomedical and HealthInformatics, vol. 18, no. 2, pp. 430–439, 2014.[4] T. Jung, X. Mao, X. Li, S.-J. Tang, W. Gong, and L. Zhang, “Privacypreservingdata aggregation without secure channel: multivariate polynomialevaluation,” in Proceedings of INFOCOM. IEEE, 2013, pp.2634–2642.[5] Y. Yang, H. Li, W. Liu, H. Yang, and M. Wen, “Secure dynamicsearchable symmetric encryption with constant document update cost,”in Proceedings of GLOBCOM. IEEE, 2014, to appear.[6] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multikeywordranked search over encrypted cloud data,” IEEE Transactionson Parallel and Distributed Systems, vol. 25, no. 1, pp. 222–233, 2014.[7] https://support.google.com/websearch/answer/173733?hl=en.[8] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searcheson encrypted data,” in Proceedings of S&P. IEEE, 2000, pp. 44–55.[9] R. Li, Z. Xu, W. Kang, K. C. Yow, and C.-Z. Xu, “Efficient multikeywordranked query over encrypted data in cloud computing,” FutureGeneration Computer Systems, vol. 30, pp. 179–190, 2014.[10] H. Li, D. Liu, Y. Dai, T. H. Luan, and X. Shen, “Enabling efficientmulti-keyword ranked search over encrypted cloud data through blindstorage,” IEEE Transactions on Emerging Topics in Computing, 2014,DOI10.1109/TETC.2014.2371239.[11] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keywordsearch over encrypted cloud data,” in Proceedings of ICDCS. IEEE,2010, pp. 253–262.[12] A. Boldyreva, N. Chenette, Y. Lee, and A. Oneill, “Order-preservingsymmetric encryption,” in Advances in Cryptology-EUROCRYPT.Springer, 2009, pp. 224–241.[13] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li,“Verifiable privacy-preserving multi-keyword text search in the cloudsupporting similarity-based ranking,” IEEE Transactions on Parallel andDistributed Systems, vol. DOI: 10.1109/TPDS.2013.282, 2013.[14] J. Yu, P. Lu, Y. Zhu, G. Xue, and M. Li, “Towards secure multikeywordtop-k retrieval over encrypted cloud data,” IEEE Transactionson Dependable and Secure Computing, vol. 10, no. 4, pp. 239–250,2013.[15] A. Arvanitis and G. Koutrika, “Towards preference-aware relationaldatabases,” in International Conference on Data Engineering (ICDE).IEEE, 2012, pp. 426–437.[16] G. Koutrika, E. Pitoura, and K. Stefanidis, “Preference-based querypersonalization,” in Advanced Query Processing. Springer, 2013, pp.57–81.[17] B. Zhang and F. Zhang, “An efficient public key encryption withconjunctive-subset keywords search,” Journal of Network and ComputerApplications, vol. 34, no. 1, pp. 262–267, 2011.[18] D. Stinson, Cryptography: theory and practice. CRC press, 2006.[19] H. Li, D. Liu, K. Jia, and X. Lin, “Achieving authorized and rankedmulti-keyword search over encrypted cloud data,” in Proceedings of ICC.IEEE, 2015, to appear.[20] S. Zerr, E. Demidova, D. Olmedilla, W. Nejdl, M. Winslett, andS. Mitra, “Zerber: r-confidential indexing for distributed documents,” inProceedings of the 11th international conference on Extending databasetechnology: Advances in database technology. ACM, 2008, pp. 287–298.[21] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis, “Secureknn computation on encrypted databases,” in Proceedings of SIGMODInternational Conference on Management of data. ACM, 2009, pp.139–152.[22] J. Zobel and A. Moffat, “Exploring the similarity space,” in ACM SIGIRForum, vol. 32, no. 1. ACM, 1998, pp. 18–34.[23] N. Ferguson, R. Schroeppel, and D. Whiting, “A simple algebraicrepresentation of rijndael,” in Selected Areas in Cryptography. Springer,2001, pp. 103–111.[24] http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html.1545-5971 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2015.2406704, IEEE Transactions on Dependable and Secure Computing14[25] C. Wang, N. Cao, K. Ren, and W. Lou, “Enabling secure and efficientranked keyword search over outsourced cloud data,” IEEE Transactionson Parallel and Distributed Systems, vol. 23, no. 8, pp. 1467–1479,2012.[26] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword searchover encrypted data,” in Applied Cryptography and Network Security.Springer, 2004, pp. 31–45.[27] D. Boneh and B. Waters, “Conjunctive, subset, and range queries onencrypted data,” in Theory of cryptography. Springer, 2007, pp. 535–554.[28] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public keyencryption with keyword search,” in Advances in Cryptology–Eurocrypt.Springer, 2004, pp. 506–522.[29] Y. Hwang and P. Lee, “Zpublic key encryption with conjunctive keywordsearch and its extension to a multi-user system,” in Proceeding ofPairing. Springer, 2007, pp. 2–22.[30] Q. Liu, C. C. Tan, J. Wu, and G. Wang, “Efficient information retrievalfor ranked queries in cost-effective cloud environments,” in Proceedingsof INFOCOM. IEEE, 2012, pp. 2581–2585.Hongwei Li is an Associate Professor with theSchool of Computer Science and Engineering,University of Electronic Science and Technologyof China, China. He received the PhD degreein computer software and theory from Universityof Electronic Science and Technology of China,China in 2008. He has worked as a Post-Doctoral Fellow in Department of Electrical andComputer Engineering at University of Waterloofor one year until Oct. 2012. His research interestsinclude network security, applied cryptography,and trusted computing. Dr. Li serves as the Associate Editor ofPeer-to-Peer Networking and Applications, the Guest Editor for Peerto-Peer Networking and Applications Special Issue on Security andPrivacy of P2P Networks in Emerging Smart City. He also serves onthe technical program committees for many international conferencessuch as IEEE INFOCOM, IEEE ICC, IEEE GLOBECOM, IEEE WCNC,IEEE SmartGridComm, BODYNETS and IEEE DASC. He is a memberof IEEE, a member of China Computer Federation and a member ofChina Association for Cryptologic Research.Yi Yang received his B.S. degree in NetworkEngineering from Tianjin University of Scienceand Technology (TUST) in 2012. Currently, he isa master student at the School of Computer Scienceand Engineering, University of ElectronicScience and Technology of China (UESTC), China.He serves as the reviewer of Peer-to-PeerNetworking and Application, IEEE INFOCOM,IEEE ICC, IEEE GLOBECOM, IEEE ICCC, etc.His research interests include cryptography, andthe secure smart grid.Tom H. Luan received the B.Sc. degree fromXian Jiaotong University, China, in 2004, theM.Phil. degree from Hong Kong University ofScience and Technology, Hong Kong, China, in2007, and Ph.D. degrees from the Universityof Waterloo, Canada, in 2012. Since December2013, he has been the Lecturer in Mobile andApps at the School of Information Technology,Deakin University, Melbourne Burwood, Australia.His research mainly focuses on vehicularnetworking, wireless content distribution, peerto-peer networking and mobile cloud computing.Xiaohui Liang received the B.Sc. degree inComputer Science and Engineering and theM.Sc. degree in Computer Software and Theoryfrom Shanghai Jiao Tong University (SJTU), China,in 2006 and 2009, respectively. He is currentlyworking toward a Ph.D. degree in the Departmentof Electrical and Computer Engineering,University of Waterloo, Canada. His research interestsinclude applied cryptography, and securityand privacy issues for e-healthcare system,cloud computing, mobile social networks, andsmart grid.Liang Zhou is a professor with the NationalKey Lab of Science and Technology on Communicationat University of Electronic Scienceand Technology of China, China. His currentresearch interests include error control coding,secure communication and cryptography.Xuemin (Sherman) Shen is a Professor andUniversity Research Chair, Department of Electricaland Computer Engineering, University ofWaterloo, Canada. He was the Associate Chairfor Graduate Studies from 2004 to 2008. Dr.Shen’s research focuses on resource managementin interconnected wireless/wired networks,wireless network security, vehicular ad hocand sensor networks. Dr. Shen served as theTechnical Program Committee Chair for IEEEVTC’10 Fall and IEEE Globecom’07. He alsoserves/served as the Editor-in-Chief for IEEE Network, Peer-to-PeerNetworking and Application, and IET Communications; a Founding AreaEditor for IEEE Transactions on Wireless Communications; an AssociateEditor for IEEE Transactions on Vehicular Technology, ComputerNetworks. Dr. Shen is a registered Professional Engineer of Ontario,Canada, an IEEE Fellow, an Engineering Institute of Canada Fellow, aCanadian Academy of Engineering Fellow, and a Distinguished Lecturerof IEEE Vehicular Technology Society and Communications Society.

Enabling Efficient Multi-Keyword Ranked Search Over Encrypted Mobile Cloud Data Through Blind Storage

In mobile cloud computing, a fundamental application is to outsource the mobile data to external cloud servers for scalable data storage. The outsourced data, however, need to be encrypted due to the privacy and confidentiality concerns of their owner. This results in the distinguished difficulties on the accurate search over the encrypted mobile cloud data.

In this paper, we develop the searchable encryption for multi-keyword ranked search over the storage data. Specifically, by considering the large number of outsourced documents (data) in the cloud, we utilize the relevance score and k-nearest neighbor techniques to develop an efficient multi-keyword search scheme that can return the ranked search results based on the accuracy.

This framework, we leverage an efficient index to further improve the search efficiency, and adopt the blind storage system to conceal access pattern of the search user. Security analysis demonstrates that our scheme can achieve confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Finally, using extensive simulations, we show that our proposal can achieve much improved efficiency in terms of search functionality and search time compared with the existing proposals.

1.1 GOAL OF THE PROJECT:

Efficient and privacy-preserving multi-keyword ranked search over encrypted mobile cloud data via blind storage system, the EMRS has following design goals:

• Multi-Keyword Ranked Search: To meet the requirements for practical uses and provide better user experience, the EMRS should not only support multi-keyword search over encrypted mobile cloud data, but also achieve relevance-based result ranking.

• Search Efficiency: Since the number of the total documents may be very large in a practical situation, the EMRS should achieve sublinear search with better search efficiency.

• Confidentiality and Privacy Preservation: To prevent the cloud server from learning any additional information about the documents and the index, and to keep search users’ trapdoors secret, the EMRS should cover all the security requirements that we introduced above.

1.2 INTRODUCTION

Mobile cloud computing gets rid of the hardware limitation of mobile devices by exploring the scalable and virtualized cloud storage and computing resources, and accordingly is able to provide much more powerful and scalable mobile services to users. In mobile cloud computing, mobile users typically outsource their data to external cloud servers, e.g., iCloud, to enjoy a stable, low-cost and scalable way for data storage and access. However, as outsourced data typically contain sensitive privacy information, such as personal photos, emails, etc., which would lead to severe confidentiality and privacy violations, if without efficient protections. It is therefore necessary to encrypt the sensitive data before outsourcing them to the cloud. The data encryption, however, would result in salient difficulties when other users need to access interested data with search, due to the difficulties of search over encrypted data.

This fundamental issue in mobile cloud computing accordingly motivates an extensive body of research in the recent years on the investigation of searchable encryption technique to achieve efficient searching over outsourced encrypted data. A collection of research works have recently been developed on the topic of multi-keyword search over encrypted data. Propose a symmetric searchable encryption scheme which achieves high efficiency for large databases with modest scarification on security guarantees. Propose a multi-keyword search scheme supporting result ranking by adopting k-nearest neighbors (kNN) technique. Propose a dynamic searchable encryption scheme through blind storage to conceal access pattern of the search user.

In order to meet the practical search requirements, search over encrypted data should support the following three functions.

First, the searchable encryption schemes should support multi-keyword search, and provide the same user experience as searching in Google search with different keywords; single-keyword search is far from satisfactory by only returning very limited and inaccurate search results. Second, to quickly identify most relevant results, the search user would typically prefer cloud servers to sort the returned search results in a relevance-based order ranked by the relevance of the search request to the documents. In addition, showing the ranked search to users can also eliminate the unnecessary network traffic by only sending back the most relevant results from cloud to search users.

Third, as for the search efficiency, since the number of the documents contained in a database could be extraordinarily large, searchable encryption schemes should be efficient to quickly respond to the search requests with minimum delays.

In contrast to the theoretical benefits, most of the existing proposals, however, fail to offer sufficient insights towards the construction of full functioned searchable encryption as described above. As an effort towards the issue, in this paper, we propose an efficient multi-keyword ranked search (EMRS) scheme over encrypted mobile cloud data through blind storage.

Our main contributions can be summarized as follows:

• We introduce a relevance score in searchable encryption to achieve multi-keyword ranked search over the encrypted mobile cloud data. In addition to that, we construct an efficient index to improve the search efficiency.

• By modifying the blind storage system in the EMRS, we solve the trapdoor unlinkability problem and conceal access pattern of the search user from the cloud server.

• We give thorough security analysis to demonstrate that the EMRS can reach a high security level including confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Moreover, we implement extensive experiments, which show that the EMRS can achieve enhanced efficiency in the terms of functionality and search efficiency compared with existing proposals.

1.3 LITRATURE SURVEY

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing works built various types of secure index and corresponding index-based keyword matching algorithms to improve search efficiency. All these works only support the search of single keyword. Subsequent works extended the search capability to multiple, conjunctive or disjunctive, keywords search. However, they support only exact keyword matching. Misspelled keywords in the query will result in wrong or no matching. Very recently, a few works extended the search capability to approximate keyword matching (also known as fuzzy search). These are all for single keyword search, with a common approach involving expanding the index file by covering possible combinations of keyword misspelling so that a certain degree of spelling error, measured by edit distance, can be tolerated. Although a wild-card approach is adopted to minimize the expansion of the resulting index file, for a l-letter long keyword to tolerate an error up to an edit distance of d, the index has to be expanded times.

Thus, it is not scalable as the storage complexity increases exponentially with the increase of the error tolerance. To support multi-keyword search, the search algorithm will have to run multiple rounds To date, efficient multi-keyword fuzzy search over encrypted data remains a challenging problem. We want to point out that the efforts on search over encrypted data involve not only information retrieval techniques such as advanced data structures used to represent the searchable index, and efficient search algorithms that run over the corresponding data structure, but also the proper design of cryptographic protocols to ensure the security and privacy of the overall system. Although single keyword search and fuzzy search have been implemented separately, a combination of the two does not lead to a secure and efficient single keyword fuzzy search scheme.

2.1.1 DISADVANTAGES:

The large number of data users and documents in cloud, it is crucial for the search service to allow multi-keyword query and provide result similarity ranking to meet the effective data retrieval need. The searchable encryption focuses on single keyword search or Boolean keyword search, and rarely differentiates the search results.

  • Single-keyword search without ranking
  • Boolean- keyword search without ranking
  • Single-keyword similarity search with ranking


2.2 PROPOSED SYSTEM:

Propose a symmetric searchable encryption scheme which achieves high efficiency for large databases with modest scarification on security guarantees. Propose a multi-keyword search scheme supporting result ranking by adopting k-nearest neighbors (kNN) technique. Propose a dynamic searchable encryption scheme through blind storage to conceal access pattern of the search user.

We propose the detailed EMRS. Since the encrypted documents and index z are both stored in the blind storage system, we would provide the general construction of the blind storage system. Moreover, since the EMRS aims to eliminate the risk of sharing the key that is used to encrypt the documents with all search users and solve the trapdoor unlinkability problem in Naveed’s scheme.

We modify the construction of blind storage and leverage ciphertext policy attribute-based encryption (CP-ABE) technique in the EMRS. However, specific construction of CP-ABE is out of scope of this paper and we only give a simple indication here. The notations of this paper are shown in Table 1. The EMRS consists of the following phases: System Setup, Construction of Blind Storage, Encrypted Database Setup, Trapdoor Generation, Efficient and Secure Search, and Retrieve Documents from Blind Storage.

2.2.1 ADVANTAGES:

In this paper, we propose an efficient multi-keyword ranked search (EMRS) scheme over encrypted mobile cloud data through blind storage.

Our main contributions can be summarized as follows:

• We introduce a relevance score in searchable encryption to achieve multi-keyword ranked search over the encrypted mobile cloud data. In addition to that, we construct an efficient index to improve the search efficiency.

• By modifying the blind storage system in the EMRS, we solve the trapdoor unlinkability problem and conceal access pattern of the search user from the cloud server.

• We give thorough security analysis to demonstrate that the EMRS can reach a high security level including confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Moreover, we implement extensive experiments, which show that the EMRS can achieve enhanced efficiency in the terms of functionality and search efficiency compared with existing proposals

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Back End                                :           MYSQL Server
  • Server                                      :           Apache Tomact Server
  • Script                                       :           JSP Script
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

EMRS SCHEME (EFFICIENT MULTI-KEYWORD RANKED SEARCH):

4.1 ALGORITHM

CP-ABE ENCRYPTION ALGORITHM:

The data owner builds the encrypted database as follows:

Step 1: The data owner computes the d-dimension relevance vector p = (p1, p2, · · · pd ) for each document using the TF-IDF weighting technique, where pj for j ∈ (1, 2 · · · d) represents the weighting of keyword ωj in document di . Then, the data owner extends the p to a (d+2)-dimension vector p ∗ . The (d+1)-th entry of p ∗ is set to a random number ε and the (d+2)-th entry is set to 1. We would let ε follow a normal distribution N(µ, σ2 ) [11]. For each document di , to compute the encrypted relevance vector, the data owner encrypts the associated extended relevance vector p ∗ using the secret key M1, M2 and S. First, the data owner chooses a random number r and splits the extended relevance vector p ∗ into two (d+2)-dimension vectors p 0 and p 00 using the vector S. For the j-th item in p ∗ , set

Step 2: For each document di in D, set the document into blocks of mb bits each. For each block, there is a header H(idi) indicating that this block belongs to document di . And the sizei of the document is contained in the header of the first block of di . Then, for each document di , the data owner chooses a 192-bit key Ki for the algorithm Enc(). More precisely, for each block B[j] of the document di , where j represents the index number of this block, compute the Ki ⊕ 8(j) as the key for the encryption of this block. Since each block has a unique index number, the blocks of the same document are encrypted with different keys. The document di contains sizei encrypted blocks and the first block of the document di with index number j is as

Finally, the data owner encrypts all the documents and writes them to the blind storage system using the B.Build function. Step 3: To enable efficient search over the encrypted documents, the data owner builds the index z. First, the data owner defines the access policy υi for each document di . We denote the result of attribute-based encryption using access policy υi as ABEυi (). The data owner initializes z to an empty array indexed by all keywords. Then, the index z can be constructed as shown in Algorithm 1.

4.2 MODULES:

SEARCHABLE ENCRYPTION CP-ABE:

MULTI-KEYWORD RANKED SEARCH:

BLIND STORAGE SYSTEM:

EMRS SECURITY REQUIRMENTS:

4.3 MODULE DESCRIPTION:

SEARCHABLE ENCRYPTION CP-ABE:

In ciphertext policy attribute-based encryption (CP-ABE), ciphertexts are created with an access structure (usually an access tree) which defines the access policy.A user can decrypt the data only if the attributes embedded in his attribute keys satisfy the access policy in the ciphertext. In CP-ABE, the encrypter holds the ultimate authority of the access policy. The documents are encrypted by the traditional symmetric cryptography technique before being outsourced to the cloud server. Without a correct key, the search user and cloud server cannot decrypt the documents. As for index confidentiality, the relevance vector for each document is encrypted using the secret key M1, M2, and S. And the descriptors of the documents are encrypted using CP-ABE technique. Thus, the cloud server can only use the index z to retrieve the encrypted relevance vectors without knowing any additional information, such as the associations between the documents and the keywords. And only the search user with correct attribute keys can decrypt the descriptor ABE_i (idijjKijjx) to get the document id and the associated symmetric key. Thus, the confidentiality of documents and index can be well protected.

MULTI-KEYWORD RANKED SEARCH:

Multi-keyword rankedsearch over encrypted data should support the following three functions. First, the searchable encryption schemes should support multi-keyword search, and provide the same user experience as searching in Google search with different keywords; single-keyword search is far from satisfactory by only returning very limited and inaccurate search results. Second, to quickly identify most relevant results, the search user would typically prefer cloud servers to sort the returned search results in a relevance-based order ranked by the relevance of the search request to the documents. In addition, showing the ranked search to users can also eliminate the unnecessary network traffic by only sending back the most relevant results from cloud to search users. Third, as for the search efficiency, since the number of the documents contained in a database could be extraordinarily large, searchable encryption schemes should be efficient to quickly respond to the search requests with minimum delays.

BLIND STORAGE SYSTEM:

A blind storage system is built on the cloud server to support adding, updating and deleting documents and concealing the access pattern of the search user from the cloud server. In the blind storage system, all documents are divided into fixed-size blocks. These blocks are indexed by a sequence of random integers generated by a document-related seed. In the view of a cloud server, it can only see the blocks of encrypted documents uploaded and downloaded. Thus, the blind storage system leaks little information to the cloud server. Specifically, the cloud server does not know which blocks are of the same document, even the total number of the documents and the size of each document. Moreover, all the documents and index can be stored in the blind storage system to achieve a searchable encryption scheme.

EMRS SECURITY REQUIRMENTS:

EMRS, we consider the cloud server to be curious but honest which means it executes the task assigned by the data owner and the search user correctly. However, it is curious about the data in its storage and the received trapdoors to obtain additional information. Moreover, we consider the Knowing Background model in the EMRS, which allows the cloud server to know more background information of the documents such as statistical information of the keywords.

Specifically, the EMRS aims to provide the following four security requirements:

• Confidentiality of Documents and Index: Documents and index should be encrypted before being outsourced to a cloud server. The cloud server should be prevented from prying into the outsourced documents and cannot deduce any associations between the documents and keywords using the index.

• Trapdoor Privacy: Since the search user would like to keep her searches from being exposed to the cloud server, the cloud server should be prevented from knowing the exact keywords contained in the trapdoor of the search user.

• Trapdoor Unlinkability: The trapdoors should not be linkable, which means the trapdoors should be totally different even if they contain the same keywords. In other words, the trapdoors should be randomized rather than determined. The cloud server cannot deduce any associations between two trapdoors.

• Concealing Access Pattern of the Search User: Access pattern is the sequence of the searched results. In the EMRS, the access pattern should be totally concealed from the cloud server. Specifically, the cloud server cannot learn the total number of the documents stored on it nor the size of the searched document even when the search user retrieves this document from the cloud server.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program
Compilers
Interpreter
My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

In this paper, we have proposed a multi-keyword ranked search scheme to enable accurate, efficient and secure search over encrypted mobile cloud data. Security analysis have demonstrated that proposed scheme can effectively achieve confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Extensive performance evaluations have shown that the proposed scheme can achieve better efficiency in terms of the functionality and computation overhead compared with existing ones. For the future work, we will investigate on the authentication and access control issues in searchable encryption technique.

EMR A Scalable Graph-based Ranking Model for Content-based Image Retrieval

Abstract—Graph-based ranking models have been widely applied in information retrieval area. In this paper, we focus on a wellknown graph-based model – the Ranking on Data Manifold model, or Manifold Ranking (MR). Particularly, it has been successfullyapplied to content-based image retrieval, because of its outstanding ability to discover underlying geometrical structure of the givenimage database. However, manifold ranking is computationally very expensive, which significantly limits its applicability to largedatabases especially for the cases that the queries are out of the database (new samples). We propose a novel scalable graph-basedranking model called Efficient Manifold Ranking (EMR), trying to address the shortcomings of MR from two main perspectives:scalable graph construction and efficient ranking computation. Specifically, we build an anchor graph on the database instead of atraditional k-nearest neighbor graph, and design a new form of adjacency matrix utilized to speed up the ranking. An approximatemethod is adopted for efficient out-of-sample retrieval. Experimental results on some large scale image databases demonstrate thatEMR is a promising method for real world retrieval applications.Index Terms—Graph-based algorithm, ranking model, image retrieval, out-of-sample1 INTRODUCTIONGRAPH-BASED ranking models have been deeplystudied and widely applied in information retrievalarea. In this paper, we focus on the problem of applyinga novel and efficient graph-based model for contentbasedimage retrieval (CBIR), especially for out-of-sampleretrieval on large scale databases.Traditional image retrieval systems are based on keywordsearch, such as Google and Yahoo image search. Inthese systems, a user keyword (query) is matched withthe context around an image including the title, manualannotation, web document, etc. These systems don’tutilize information from images. However these systemssuffer many problems, such as shortage of the text informationand inconsistency of the meaning of the text andimage. Content-based image retrieval is a considerablechoice to overcome these difficulties. CBIR has drawn agreat attention in the past two decades [1]–[3]. Differentfrom traditional keyword search systems, CBIR systems utilizethe low-level features, including global features (e.g.,color moment, edge histogram, LBP [4]) and local features(e.g., SIFT [5]), automatically extracted from images. A great• B. Xu, J. Bu, C. Chen, and C. Wang are with the Zhejiang ProvincialKey Laboratory of Service Robot, College of Computer Science, ZhejiangUniversity, Hangzhou 310027, China.E-mail: {xbzju, bjj, chenc, wcan}@zju.edu.cn.D. Cai and X. He are with the State Key Lab of CAD&CG, Collegeof Computer Science, Zhejiang University, Hangzhou 310027, China.E-mail: {dengcai, xiaofeihe}@cad.zju.edu.cn.Manuscript received 9 Oct. 2012; revised 7 Apr. 2013; accepted 22 Apr. 2013.Date of publication 1 May 2013; date of current version 1 Dec. 2014.Recommended for acceptance by H. Zha.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2013.70amount of researches have been performed for designingmore informative low-level features to represent images,or better metrics (e.g., DPF [6]) to measure the perceptualsimilarity, but their performance is restricted by many conditionsand is sensitive to the data. Relevance feedback [7]is a useful tool for interactive CBIR. User’s high level perceptionis captured by dynamically updated weights basedon the user’s feedback.Most traditional methods focus on the data features toomuch but they ignore the underlying structure information,which is of great importance for semantic discovery,especially when the label information is unknown. Manydatabases have underlying cluster or manifold structure.Under such circumstances, the assumption of label consistencyis reasonable [8], [9]. It means that those nearby datapoints, or points belong to the same cluster or manifold,are very likely to share the same semantic label. This phenomenonis extremely important to explore the semanticrelevance when the label information is unknown. In ouropinion, a good CBIR system should consider images’ lowlevelfeatures as well as the intrinsic structure of the imagedatabase.Manifold Ranking (MR) [9], [10], a famous graph-basedranking model, ranks data samples with respect to theintrinsic geometrical structure collectively revealed by alarge number of data. It is exactly in line with our consideration.MR has been widely applied in many applications,and shown to have excellent performance and feasibilityon a variety of data types, such as the text [11], image[12], [13], and video[14]. By taking the underlying structureinto account, manifold ranking assigns each data sample arelative ranking score, instead of an absolute pairwise similarityas traditional ways. The score is treated as a similarity1041-4347 c_ 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.XU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 103metric defined on the manifold, which is more meaningfulto capturing the semantic relevance degree. He et al. [12]firstly applied MR to CBIR, and significantly improvedimage retrieval performance compared with state-of-the-artalgorithms.However, manifold ranking has its own drawbacks tohandle large scale databases – it has expensive computationalcost, both in graph construction and ranking computationstages. Particularly, it is unknown how to handlean out-of-sample query (a new sample) efficiently underthe existing framework. It is unacceptable to recompute themodel for a new query. That means, original manifold rankingis inadequate for a real world CBIR system, in whichthe user provided query is always an out-of-sample.In this paper, we extend the original manifold rankingand propose a novel framework named Efficient ManifoldRanking (EMR). We try to address the shortcomings ofmanifold ranking from two perspectives: the first is scalablegraph construction; and the second is efficient computation,especially for out-of-sample retrieval. Specifically, webuild an anchor graph on the database instead of the traditionalk-nearest neighbor graph, and design a new form ofadjacency matrix utilized to speed up the ranking computation.The model has two separate stages: an offline stagefor building (or learning) the ranking model and an onlinestage for handling a new query. With EMR, we can handle adatabase with 1 million images and do the online retrievalin a short time. To the best of our knowledge, no previousmanifold ranking based algorithm has run out-of-sampleretrieval on a database in this scale.A preliminary version of this work previously appearedas [13]. In this paper, the new contributions are as follows:• We pay more attention to the out-of-sample retrieval(online stage) and propose an efficient approximatemethod to compute ranking scores for a new queryin Section 4.5. As a result, we can run out-ofsampleretrieval on a large scale database in a shorttime.• We have optimized the EMR code1 and re-run all theexperiments (Section 5). Three new databases includingtwo large scale databases with about 1 millionssamples are added for testing the efficiency of theproposed model. We offer more detailed analysis forexperimental result.• We formally define the formulation of local weightestimation problem (Section 4.1.1) for buildingthe anchor graph and two different methods arecompared to determine which method is better(Section 5.2.2).The rest of this paper is organized as follows. InSection 2, we briefly discuss some related work and inSection 3, we review the algorithm of MR and makean analysis. The proposed approach EMR is described inSection 4. In Section 5, we present the experiment resultson many real world image databases. Finally we provide aconclusions in Section 6.1. http://eagle.zju.edu.cn/∼binxu/2 RELATED WORKThe problem of ranking has recently gained great attentionsin both information retrieval and machine learning areas.Conventional ranking models can be content based models,like the Vector Space Model, BM25, and the language modeling[15]; or link structure based models, like the famousPageRank [16] and HITS [17]; or cross media models [18].Another important category is the learning to rank model,which aims to optimize a ranking function that incorporatesrelevance features and avoids tuning a large numberof parameters empirically [19], [20]. However, many conventionalmodels ignore the important issue of efficiency,which is crucial for a real-time systems, such as a web application.In [21], the authors present a unified framework forjointly optimizing effectiveness and efficiency.In this paper, we focus on a particular kind of rankingmodel – graph-based ranking. It has been successfullyapplied in link-structure analysis of the web [16], [17], [22]–[24], social networks research [25]–[27] and multimedia dataanalysis [28]. Generally, a graph [29] can be denoted asG = (V, E,W), where V is a set of vertices in which eachvertex represents a data point, E V × V is a set of edgesconnecting related vertices, and W is a adjacency matrixrecording the pairwise weights between vertices. The objectof a graph-based ranking model is to decide the importanceof a vertex, based on local or global information draw fromthe graph.Agarwal [30] proposed to model the data by a weightedgraph, and incorporated this graph structure into the rankingfunction as a regularizer. Guan et al. [26] proposed agraph-based ranking algorithm for interrelated multi-typeresources to generate personalized tag recommendation.Liu et al. [25] proposed an automatically tag ranking schemeby performing a random walk over a tag similarity graph.In [27], the authors made the music recommendation byranking on a unified hypergraph, combining with richsocial information and music content. Hypergraph is a newgraph-based model and has been studied in many works[31]. Recently, there have been some papers on speeding upmanifold ranking. In [32], the authors partitioned the datainto several parts and computed the ranking function by ablock-wise way.3 MANIFOLD RANKING REVIEWIn this section, we briefly review the manifold ranking algorithmand make a detailed analysis about its drawbacks.Westart form the description of notations.3.1 Notations and FormulationsGiven a set of data χ = {x1, x2, . . . , xn} ⊂ Rm and builda graph on the data (e.g., kNN graph). W Rn×n denotesthe adjacency matrix with element wij saving the weight ofthe edge between point i and j. Normally the weight canbe defined by the heat kernel wij = exp [ − d2(xi, xj)/2σ2)]if there is an edge linking xi and xj, otherwise wij = 0.Function d(xi, xj) is a distance metric of xi and xj definedon χ, such as the Euclidean distance. Let r:χ R be aranking function which assigns to each point xi a rankingscore ri. Finally, we define an initial vector y = [y1, . . . , yn]T,in which yi = 1 if xi is a query and yi = 0 otherwise.104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015The cost function associated with r is defined to beO(r) = 12⎛⎝_ni,j=1wij_ 1 √Diiri − 1 _Djjrj_2 + μ_ni=1_ri yi_2⎞⎠,(1)where μ > 0 is the regularization parameter and D is adiagonal matrix with Dii =_nj=1 wij.The first term in the cost function is a smoothness constraint,which makes the nearby points in the space havingclose ranking scores. The second term is a fitting constraint,which means the ranking result should fit to theinitial label assignment. With more prior knowledge aboutthe relevance or confidence of each query, we can assigndifferent initial scores to the queries. Minimizing the costfunction respect to r results into the following closed formsolutionr∗ = (In αS)−1y, (2)where α = 11+μ, In is an identity matrix with n×n, and S isthe symmetrical normalization of W, S = D−1/2WD−1/2. Inlarge scale problems, we prefer to use the iteration scheme:r(t + 1) = αSr(t) + (1 − α)y. (3)During each iteration, each point receives informationfrom its neighbors (first term), and retains its initial information(second term). The iteration process is repeateduntil convergence. When manifold ranking is applied toretrieval (such as image retrieval), after specifying a queryby the user, we can use the closed form or iteration schemeto compute the ranking score of each point. The rankingscore can be viewed as a metric of the manifold distancewhich is more meaningful to measure the semanticrelevance.3.2 AnalysisAlthough manifold ranking has been widely used in manyapplications, it has its own drawbacks to handle large scaledatabased, which significantly limits its applicability.The first is its graph construction method. The kNNgraph is quite appropriate for manifold ranking becauseof its good ability to capture local structure of the data. Butthe construction cost for kNN graph is O(n2 log k), whichis expensive in large scale situations. Moreover, manifoldranking, as well as many other graph-based algorithmsdirectly use the adjacency matrix W in their computation.The storage cost of a sparse W is O(kn). Thus, we need tofind a way to build a graph in both low construction costand small storage space, as well as good ability to captureunderlying structure of the given database.The second, manifold ranking has very expensive computationalcost because of the matrix inversion operationin equation (2). This has been the main bottleneck to applymanifold ranking in large scale applications. Although wecan use the iteration algorithm in equation (3), it is stillinefficient in large scale cases and may arrive at a local convergence.Thus, original manifold ranking is inadequate fora real-time retrieval system.4 EFFICIENT MANIFOLD RANKINGWe address the shortcomings of original MR from twoperspectives: scalable graph construction and efficient rankingcomputation. Particularly, our method can handle theout-of-sample retrieval, which is important for a real-timeretrieval system.4.1 Scalable Graph ConstructionTo handle large databases, we want the graph constructioncost to be sub-linear with the graph size. That means, foreach data point, we can’t search the whole database, as kNNstrategy does. To achieve this requirement, we constructan anchor graph [33], [34] and propose a new design ofadjacency matrix W.The definitions of anchor points and anchor graph haveappeared in some other works. For instance, in [35], theauthors proposed that each data point on the manifoldcan be locally approximated by a linear combination of itsnearby anchor points, and the linear weights become itslocal coordinate coding. Liu et al. [33] designed the adjacencymatrix in a probabilistic measure and used it forscalable semi-supervised learning. This work inspires usmuch.4.1.1 Anchor Graph ConstructionNow we introduce how to use anchor graph to modelthe data [33], [34]. Suppose we have a data set χ ={x1, . . . , xn} ⊂ Rm with n samples in m dimensions, andU = {u1, . . . , ud} ⊂ Rm denotes a set of anchors sharingthe same space with the data set. Let f :χ R be a realvalue function which assigns each data point in χ a semanticlabel. We aim to find a weight matrix Z Rd×n thatmeasures the potential relationships between data pointsin χ and anchors in U. Then we estimate f (x) for each datapoint as a weighted average of the labels on anchors       f(xi) =_dk=1zkif (uk), i = 1, . . . , n, (4)with constraints_dk=1 zki = 1 and zki ≥ 0. Element zki representsthe weight between data point xi and anchor uk. Thekey point of the anchor graph construction is how to computethe weight vector zi for each data point xi. Two issuesneed to be considered: (1) the quality of the weight vectorand (2) the cost of the computation.Similar to the idea of LLE [8], a straightforward wayto measure the local weight is to optimize the followingconvex problem:minziε(zi) = 12_xi −_|N(xi)|s=1 usN(xi)zis_2s.t._s zis = 1, zi ≥ 0,(5)where N(xi) is the index set of xi’s nearest anchors. Wecall the above problem as the local weight estimation problem.A standard quadratic programming (QP) can solve thisproblem, but QP is very computational expensive. A projectedgradient based algorithm was proposed in [33] tocompute weight matrix and in our previous work [13], akernel regression method was adopted. In this paper, wecompare these two different methods to find the weightvector zi. Both of them are much faster than QP.XU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 105(1) Solving by Projected GradientThe first method is the projected gradient method, whichhas been used in the work of [33]. The updating rule in thismethod is expressed as the following iterative formula [33]:z(t+1)i= _s(z(t)iηtε(zti)), (6)where ηt denotes the step size of time t, ∇ε(z) denotes thegradient of ε at z, and _s(z) denotes the simplex projectionoperator on any z ∈ Rs. Detailed algorithm can be foundin Algorithm 1 of [33].(2) Solving by Kernel RegressionWe adopt the Nadaraya-Watson kernel regression toassign weights smoothly [13]zki =K|xiuk|λ__dl=1 K|xiul|λ_, (7)with the Epanechnikov quadratic kernelKλ(t) =_34(1 − t2) if |t| ≤ 1;0 otherwise.(8)The smoothing parameter λ determines the size of thelocal region in which anchors can affect the target point. Itis reasonable to consider that one data point has the samesemantic label with its nearby anchors in a high probability.There are many ways to determine the parameter λ. Forexample, it can be a constant selected by cross-validationfrom a set of training data. In this paper we use a morerobust way to get λ, which uses the nearest neighborhoodsize s to replace λ, that isλ(xi) = |xi u[s]|, (9)where u[s] is the sth closest anchor of xi. Later in the experimentpart, we’ll discuss the effectiveness and efficiency ofthe above two methods.Specifically, to build the anchor graph, we connect eachsample to its s nearest anchors and then assign the weights.So the construction has a total complexity O(nd log s), whered is the number of anchors and s is very small. Thus, thenumber of anchors determines the efficiency of the anchorgraph construction. If d  n, the construction is linear tothe database.How can we get the anchors? Active learning [36], [37] orclustering methods are considerable choices. In this paper,we use k-means algorithm and select the centers as anchors.Some fast k-means algorithms [38] can speed up the computation.Random selection is a competitive method which hasextremely low selection cost and acceptable performance.The main feature, also the main advantage of buildingan anchor graph is separating the graph construction intotwo parts – anchor selection and graph construction. Eachdata sample is independent to the other samples but relatedto the anchors only. The construction is always efficientsince it has linear complexity to the date size. Note that wedon’t have to update the anchors frequently, as informativeanchors for a large database are relatively stable (e.g., thecluster centers), even if a few new samples are added.4.1.2 Design of Adjacency MatrixWe present a new approach to design the adjacency matrixW and make an intuitive explanation for it. The weightmatrix Z Rd×n can be seen as a d dimensional representationof the data X Rm×n, d is the number of anchorpoints. That is to say, data points can be represented inthe new space, no matter what the original features are.This is a big advantage to handle some high dimensionaldata. Then, with the inner product as the metric to measurethe adjacent weight between data points, we designthe adjacency matrix to be a low-rank form [33], [39]W = ZTZ, (10)which means that if two data points are correlative (Wij >0), they share at least one common anchor point, otherwiseWij = 0. By sharing the same anchors, data pointshave similar semantic concepts in a high probability as ourconsideration. Thus, our design is helpful to explore thesemantic relationships in the data.This formula naturally preserves some good propertiesof W: sparseness and nonnegativeness. The highly sparsematrix Z makes W sparse, which is consistent with theobservation that most of the points in a graph have onlya small amount of edges with other points. The nonnegativeproperty makes the adjacent weight more meaningful:in real world data, the relationship between two items isalways positive or zero, but not negative. Moreover, nonnegativeW guarantees the positive semidefinite property ofthe graph Laplacian in many graph-based algorithms [33].4.2 Efficient Ranking ComputationAfter graph construction, the main computational cost formanifold ranking is the matrix inversion in equation (2),whose complexity is O(n3). So the data size n can not betoo large. Although we can use the iteration algorithm, itis still inefficient for large scale cases.One may argue that the matrix inversion can be done offline,then it is not a problem for on-line search. However,off-line calculation can only handle the case when the queryis already in the graph (an in-sample). If the query is notin the graph (an out-of-sample), for exact graph structure,we have to update the whole graph to add the new queryand compute the matrix inversion in equation (2) again.Thus, the off-line computation doesn’t work for an out-ofsamplequery. Actually, for a real CBIR system, user’s queryis always an out-of-sample.With the form of W = ZTZ , we can rewrite the equation(2), the main step of manifold ranking, by Woodburyformula as follows. Let H = ZD−12 , and S = HTH, then thefinal ranking function r can be directly computed byr∗ = (In αHTH)−1y =In HT_HHT − 1αId_−1H_y.(11)By equation (11), the inversion part (taking the mostcomputational cost) changes from a n×n matrix to a d×dmatrix. If d  n, this change can significantly speed upthe calculation of manifold ranking. Thus, applying ourproposed method to a real-time retrieval system is viable,which is a big shortage for original manifold ranking.During the computation process, we never use the adjacencymatrix W. So we don’t save the matrix W in memory,106 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015but save matrix Z instead. In equation (11), D is a diagonalmatrix with Dii =_nj=1 wij. When W = ZTZ,Dii =_nj=1zTizj = zTiv, (12)where zi is the ith column of Z and v =_nj=1 zj. Thus weget the matrix D without using W.A useful trick for computing r∗ in equation (11) is runningit from right to left. So every time we multiply a matrixby a vector, avoiding the matrix – matrix multiplication.As a result, to compute the ranking function, EMR has acomplexity O(dn + d3).4.3 Complexity AnalysisIn this subsection, we make a comprehensive complexityanalysis of MR and EMR, including the computation costand storage cost. As we have mentioned, both MR andEMR have two stages: the graph construction stage andthe ranking computation stage.For the model of MR:• MR builds a kNN graph, i.e., for each data sample,we need to calculate the relationships to its k-nearestneighbors. So the computation cost is O(n2 log k). Atthe same time, we save the adjacency matrix W Rn×n with a storage cost O(kn) since W is sparse.• In the ranking computation stage, the main stepis to compute the matrix inversion in 2, which isapproximately O(n3).For the model of EMR:• EMR builds an anchor graph, i.e., for each data sample,we calculate the relationships to its s-nearestanchors. The computation cost is O(nd log s). We usek-means to select the anchors, we need a cost ofO(Tdn), where T is the iteration number. But thisselection step can be done off-line and unnecessarilyupdated frequently. At the same time, wesave the sparse matrix Z Rd×n with a storagecost O(sn).• In the ranking computation stage, the main step isEq.(11), which has a computational complexity ofO(dn + d3).As a result, EMR has a computational cost of O(dn) +O(d3) (ignoring s, T) and a storage cost O(sn), while MR hasa computational cost of O(n2) + O(n3) and a storage costO(kn). Obviously, when d  n, EMR has a much lower costthan MR in computation.4.4 EMR for Content-Based Image RetrievalIn this part, we make a brief summary of EMR applied topure content-based image retrieval. To add more information,we just extend the data features.First of all, we extract the low-level features of imagesin the database, and use them as coordinates of data pointsin the graph. We will further discuss the low-level featuresin Section 5. Secondly, we select representative points asanchors and construct the weight matrix Z with a smallneighborhood size s. Anchors are selected off-line and doesFig. 1. Extend matrix W (MR) and Z (EMR) in the gray regions for anout-of-sample.not affect the on-line process. For a stable data set, we don’tfrequently update the anchors. At last, after the user specifyingor uploading an image as a query, we get or extract itslow-level features, update the weight matrix Z, and directlycompute the ranking scores by equation (11). Images withhighest ranking scores are considered as the most relevantand return to the user.4.5 Out-of-Sample RetrievalFor in-sample data retrieval, we can construct the graphand compute the matrix inversion part of equation (2) offline.But for out-of-sample data, the situation is totallydifferent. A big limitation of MR is that, it is hard to handlethe new sample query. A fast strategy for MR is leavingthe original graph unchanged and adding a new row anda new column to W (left picture of Fig. 1). Although thenew W is efficiently to compute, it is not helpful for theranking process (Eq.(2)). Computing Eq.(2) for each newquery in the online stage is unacceptable due to its highcomputational cost.In [40], the authors solve the out-of-sample problemby finding the nearest neighbors of the query and usingthe neighbors as query points. They don’t add the queryinto the graph, therefore their database is static. However,their method may change the query’s initial semantic meaning,and for a large database, the linear search for nearestneighbors is also costly.In contrast, our model EMR can efficiently handle thenew sample as a query for retrieval. In this subsection,we describe the light-weight computation of EMR for anew sample query. We want to emphasize that this is abig improvement over our previous conference version ofthis work, which makes EMR scalable for large-scale imagedatabases (e.g., 1 million samples). We show the algorithmas follows.For one instant retrieval, it is unwise to update the wholegraph or rebuild the anchors, especially on a large database.We believe one point has little effect to the stable anchorsin a large data set (e.g., cluster centers). For EMR, each datapoint (zi) is independently computed, so we assign weightsbetween the new query and its nearby anchors, forming anew column of Z (right picture of Fig. 1).We use zt to denote the new column. Then, Dt = zTtvand ht = ztD−12t , where ht is the new column of H. As wehave described, the main step of EMR is Eq.(11). Our goalis to further speedup the computation of Eq.(11) for a newquery. LetC =_HHT − 1αId_−1=_ni=1hihTi− 1αId_−1, (13)XU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 107Fig. 2. COREL image samples randomly selected from semantic conceptballoon, beach, and butterfly.and the new C_ with adding the column ht isC_ =_ni=1hihTi+ hthTt− 1αId_−1≈ C (14)when n is large and ht is highly sparse. We can see thematrix C as the inverse of a covariance matrix. The aboveequation says that one single point would not affect thecovariance matrix of a large database. That is to say, thecomputation of C can be done in the off-line stage.The initial query vector yt isyt =_0n1_, (15)where 0n is a n-length zero vector. We can rewrite Eq.(11)with the new query asr(n+1)×1 =_In+1 −_HTChTtC_[H ht]__0n1_. (16)Our focus is the top n elements of r, which is equal torn×1 = −HTCht = Eht. (17)The matrix En×d = −HTC can be computed offline, i.e., inthe online stage, we need to compute a multiplication of an × d matrix and a d × 1 vector only. As ht is sparse (e.g., snon-zero elements), the essential computation is to select scolumns of E according to ht and do a weighted summation.As a result, we need to do sn scalar multiplications and(s − 1)n scalar additions to get the ranking score (rn×1) foreach database sample; while for linear scan using Euclideandistance, we need to do mn scalar subtractions, mn scalarmultiplications and (m−1)n scalar additions. As s  m, ourmodel EMR is much faster than linear scan using Euclideandistance in the online stage.5 EXPERIMENTAL STUDYIn this section, we show several experimental results andcomparisons to evaluate the effectiveness and efficiency ofour proposed method EMR on four real world databases:two middle size databases COREL (5,000 images) andMNIST (70,000 images), and two large size databasesSIFT1M (1 million sift descriptors) and ImageNet (1.2 millionimages). We use COREL and MNIST to compare theranking performance and use SIFT1M and ImageNet toshow the efficiency of EMR for out-of-sample retrieval. OurTABLE 1Statistics of the Four Databasesexperiments are implemented in MATLAB and run on acomputer with 2.0 GHz(×2) CPU, 64GB RAM.5.1 Experiments SetupThe COREL image data set is a subset of COREL imagedatabase consisting of 5,000 images. COREL is widely usedin many CBIR works [2], [41], [42]. All of the images arefrom 50 different categories, with 100 images per category.Images in the same category belong to the same semanticconcept, such as beach, bird, elephant and so on. That isto say, images from the same category are judged relevantand otherwise irrelevant. We use each image as a queryfor testing the in-sample retrieval performance. In Fig. 2,we randomly select and show nine image samples fromthree different categories. In our experiments, we extractfour kinds of effective features for COREL database, includingGrid Color Moment, edge histogram, Gabor WaveletsTexture, Local Binary Pattern and GIST feature. As a result,a 809-dimensional vector is used for each image [43].The MNIST database2 of handwritten digits has a set of70,000 examples. The images were centered in a 28 × 28image by computing the center of mass of the pixels, andtranslating the image so as to position this point at the centerof the 28 × 28 field. We use the first 60,000 images asdatabase images and the rest 10,000 images as queries fortesting the out-of-sample retrieval performance. The normalizedgray-scale values for each pixel are used as imagefeatures.The SIFT1M database contains one million SIFT featuresand each feature is represented by a 128-dimensional vector.The ImageNet is an image database organized accordingto the WordNet nouns hierarchy, in which each node ofthe hierarchy is depicted by hundreds and thousands ofimages3. We downloaded about 1.2 million images’ BoWrepresentations. A visual vocabulary of 1,000 visual wordsis adopted, i.e., each image is represented by a 1,000-lengthvector. Due to the complex structure of the database andhigh diversity of images in each node, as well as the lowquality of simple BoW representation, the retrieval task isvery hard.We use SIFT1M and ImageNet databases to evaluatethe efficiency of EMR on large and high dimensional data.We randomly select 1,000 images as out-of-sample testqueries for each. Some basic statistics of the four databasesare listed in Table 1. For COREL, MNIST and SIFT1Mdatabases, the data samples have dense features, while forImageNet database, the data samples have sparse features.5.1.1 Evaluation Metric DiscussionThere are many measures to evaluate the retrieval resultssuch as precision, recall, F measure, MAP and NDCG [44].2. http://yann.lecun.com/exdb/mnist/3. http://www.image-net.org/index108 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015They are very useful for a real CBIR application, especiallyfor a web application in which only the top returned imagescan attract user interests. Generally, the image retrievalresults are displayed screen by screen. Too many imagesin a screen will confuse the user and drop the experienceevidently. Images in the top pages attract the most interestsand attentions from the user. So the precision at K metricis significant to evaluate the image retrieval performance.MAP (Mean Average Precision) provides a single-figuremeasure of quality across recall levels. MAP has beenshown to have especially good discriminative power andstability. For a single query, Average Precision is the averageof the precision value obtained for the set of top k itemsexisting after each relevant item is retrieved, and this valueis then averaged over all queries [44]. That is, if the set ofrelevant items for a query qj Q is {d1, . . . , dmj} and Rjk isthe set of ranked retrieval results from the top result untilyou get to item dk, thenMAP(Q) = 1|Q||Q|_j=11mj_mjk=1Precision(Rjk). (18)NDCG is a wildly used metric to evaluate a ranked list[44]. NDCG@K is defined as:NDCG@K = 1IDCG×_Ki=12ri−1log2(i + 1), (19)where ri is 1 if the item at position i is a relevant item and0 otherwise. IDCG is chosen so that the perfect ranking hasa NDCG value 1.5.2 Experiments on COREL DatabaseThe goal of EMR is to improve the speed of manifold rankingwith acceptable ranking accuracy loss. We first compareour model EMR with the original manifold ranking (MR)and fast manifold ranking (FMR [32]) algorithm on CORELdatabase. As both MR and FMR are designed for in-sampleimage retrieval, we use each image as a query and evaluatein-sample retrieval performance. More comparison toranking with SVM can be found in our previous conferenceversion [13]. In this paper, we pay more attention onthe trade-off of accuracy and speed for EMR respect to MR,so we ignore the other methods.We first compare the methods without relevance feedback.Relevance feedback asks users to label some retrievedsamples, making the retrieval procedure inconvenient. Soif possible, we prefer an algorithm having good performancewithout relevance feedback. In Section 5.2.4, weevaluate the performance of the methods after one round ofrelevance feedback. MR-like algorithms can handle the relevancefeedback very efficiently – revising the initial scorevector y.5.2.1 Baseline AlgorithmEud: the baseline method using Euclidean distance forranking.MR: the original manifold ranking algorithm, the mostimportant comparison method. Our goal is to improvethe speed of manifold ranking with acceptable rankingaccuracy loss.TABLE 2Precision and Time Comparisons of TwoWeight Estimation MethodsFMR: fast manifold ranking [32] firstly partitions the datainto several parts (clustering) and computes the matrixinversion by a block-wise way. It uses the SVD techniquewhich is time consuming. So its computational bottleneckis transformed to SVD. When SVD is accurately solved,FMR equals MR. But FMR uses the approximate solution tospeed up the computation. We use 10 clusters and calculatethe approximation of SVD with 10 singular values. Higheraccuracy requires much more computational time.5.2.2 Comparisons of Two Weight Estimation Methodsfor EMRBefore the main experiment of comparing our algorithmEMR to some other models, we use a single experimentto decide which weight estimation method described inSection 4.1.1 should be adopted. We records the averageretrieval precision (each image is used as a query) and thecomputational time (seconds) of EMR with the two weightestimation methods in Table 2.From the table, we see that the two methods havevery close retrieval results. However, the projected gradientis much slower than kernel regression. In the rest ofour experiments, we use the kernel regression method toestimate the local weight (computing Z).5.2.3 PerformanceAn important issue needs to be emphasized: although wehave the image labels (categories), we don’t use them inour algorithm, since in real world applications, labeling isvery expensive. The label information can only be used toevaluation and relevance feedback.Each image is used as a query and the retrieval performanceis averaged. Fig. 3 prints the average precision (at 20to 80) of each method and Table 3 records the average valuesof recall, F1 score, NDCG and MAP (MAP is evaluatedonly for the top-100 returns). For our method EMR, 1000anchors are used. Later in the model selection part, we findthat using 500 anchors achieves a close performance. It iseasy to find that the performance of MR and EMR are veryclose, while FMR lose a little precision due to its approximationby SVD. As EMR’s goal is to improve the speedof manifold ranking with acceptable ranking accuracy loss,the performance results are not to show which method isbetter but to show the ranking performance of EMR is closeto MR on COREL.We also record the offline building time for MR, FMRand EMR in Table 3. For in-sample retrieval, all the threeXU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 109Fig. 4. Precision at the top 10 returns of the three algorithms on each category of COREL database.methods have the same steps and cost, so we ignore it onCOREL. We find that for a database with 5,000 images, allthe three methods have acceptable building time, and EMRis the most efficient. However, according to the the analysisin Section 4.3, MR’s computational cost is cubic to thedatabase size while EMR is linear to the database size. Theresult can be found in our experiments on MNIST database.The anchor points are computed off-line and do notaffect the current on-line retrieval system. In the workof [13], we have tested different strategies for anchorpoints selection, including normal k-means, fast k-meansand random anchors. The conclusion is that the cost andperformance are trade-offs in many situations.To see the performance distribution in the whole dataset more concretely, we plot the retrieval precision at top10 returns for all 50 categories in Fig. 4. As can be seen, theperformance of each algorithm varies with different categories.We find that EMR is fairly close to MR in almostevery categories, but for FMR, the distribution is totallydifferent.5.2.4 Performance with Relevance FeedbackRelevance Feedback [7] is a powerful interactive techniqueused to improve the performance of image retrieval systems.With user provided relevant/irrelevant informationon the retrieved images, The system can capture the semanticconcept of the query more correctly and graduallyimprove the retrieval precision.Fig. 3. Retrieval precision at top 20 to 80 returns of Eud (left), MR, FMRand EMR (right).Applying relevance feedback to EMR (as well as MR andFMR)is extremely simple.We update the initial vector y andrecompute the ranking scores.We use an automatic labelingstrategy to simulate relevance feedback: for each query, thetop 20 returns’ ground truth labels (relevant or irrelevant tothe query) are used as relevance feedbacks. It is performedfor one round, since the users have no patience to do more.The retrieval performance are plotted in Fig. 5. By relevancefeedback, MR, FMR and EMR get higher retrieval precisionbut still remain close to each other.5.2.5 Model SelectionModel selection plays a key role to many machine learningmethods. In some cases, the performance of an algorithmmay drastically vary by different choices of the parameters,thus we have to estimate the quality of the parameters. Inthis subsection, we evaluate the performance of our methodEMR with different values of the parameters.There are three parameters in our method EMR: s, α,and d. Parameter s is the neighborhood size in the anchorgraph. Small value of s makes the weight matrix Z verysparse. Parameter α is the tradeoff parameter in EMR andMR. Parameter d is the number of anchor points. ForTABLE 3Recall, F1, NCDG and MAP Values, as well as the OfflineBuilding Time (Seconds) of MR, FMR and EMR110 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015Fig. 5. Retrieval precision at top 20 to 80 returns of Eud (left), MR, FMRand EMR (right) after one round of relevance feedback.convenience, the parameter α is fixed at 0.99, consistentwith the experiments performed in [9], [10], [12].Fig. 6 shows the performance of EMR (Precision at 60)by k-means anchors at different values of s. We find thatthe performance of EMR is not sensitive to the selection ofs when s > 3. With small s, we can guarantee the matrix Zhighly sparse, which is helpful to efficient computation. Inour experiments, we just select s = 5.Fig. 7 shows the performance of EMR versus differentnumber of anchors in the whole data set. We findthat the performance increases very slowly when thenumber of anchors is larger than 500 (approximately).In previous experiments, we fix the number of anchorsto 1000. Actually, a smaller number of anchors, like 800or 600 anchors, can achieve a close performance. Withfewer anchors, the graph construction cost will be furtherreduced. But as the size of COREL is not large, the savingis not important.5.3 Experiments on MNIST DatabaseWe also investigate the performance of our method EMR onthe MNIST database. The samples are all gray digit imagesin the size of 28 × 28. We just use the gray values on eachFig. 6. Retrieval precision versus different values of parameter s. Thedotted line represents MR performance.Fig. 7. Retrieval precision versus different number of anchorss. Thedotted line represents MR performance.pixel to represent the images, i.e., for each sample, we usea 784-dimensional vector to represent it. The database wasseparated into 60,000 training data and 10,000 testing data,and the goal is to evaluate the performance on the testingdata. Note that although it is called ’training data’, aretrieval system never uses the given labels. All the rankingmodels use the training data itself to build their modelsand rank the samples according to the queries. Similaridea can be found in many unsupervised hashing algorithms[45], [46] for approximate and fast nearest neighborsearch.With MNIST database, we want to evaluate the efficiencyand effectiveness of the model EMR. As we havementioned, MR’s cost is cubic to the database size, whileEMR is much faster. We record the training time (buildingthe model offline) of MR, FMR and EMR (1k anchors) inTable 4 with the database size increasing step by step. Therequired time for MR and FMR increases very fast and forthe last two sizes, their procedures are out of memory dueto inverse operation. The algorithm MR with the solutionof Eq.(2) is hard to handle the size of MNIST. FMR performseven worse than MR as it clusters the samples andcomputes a large SVD – it seems that FMR is only usefulfor small-size database. However, EMR is much fasterin this test. The time cost scales linearly – 6 seconds for10,000 samples and 35 seconds for 60,000 samples. We usek-means algorithm with maximum 5 iterations to generatethe anchor points. We find that running k-means with 5iterations is good enough for anchor point selection.TABLE 4Computational Time (s) for Offline Training of MR, FMR, andEMR (1k Anchors) on MNIST DatabaseXU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 111(a) (b) (c)Fig. 8. (a) MAP values with different number of anchors for EMR. (b) Offline training time of EMR with different number of anchors. (c) Online newquery retrieval time of EMR with different number of anchors on MNIST.5.3.1 Out-of-Sample Retrieval TestIn this section, we evaluate the response time of EMRwhen handling an out-of-sample (a new sample). As MR(as well as FMR)’s framework is hard to handle the outof-sample query and is too costly for training the modelon the size of MNIST (Table 4), from now on, we don’tuse MR and FMR as comparisons, but some other rankingscore (similarity or distance) generating methods should becompared. We use the following two methods as baselinemethods:Eud: linear scan by Euclidean distance. This maybe themost simple but meaningful baseline to compare the out-ofsampleretrieval performance. Many previous fast nearestneighbor search algorithms or hashing-based algorithmswere proposed to accelerate the linear scan speed withsome accuracy loss than Euclidean distance. Their goal isdifferent with ranking – the ranking model assigns eachsample a score but not only the neighbors.LSH: locality sensitive hashing [45], a famous hashing codegenerating method. We use LSH to generate binary codesfor the images for both training and testing samples andthen calculate the hamming distance of a query to alldatabase samples as ranking metric. We use 128 bits and256 bits as the code length of LSH.In Fig. 8(a), we draw the MAP (top 200) values for allthe testing data of our model EMR with different numberof anchor points. The performance of Eud and LSHare showed by three horizontal lines. We can see that,when more than 400 anchors are used, EMR outperformsEuclidean distance metric significantly. LSH is worse thanEud due to its binary representation. We also record EMR’soffline training time and online retrieval time in Fig. 8(b)and Fig. 8(c). The computational time for both offline andonline increases linearly to the number of anchors.Then, in Table 5, we record the computational time (inseconds) and out-of-sample retrieval performance of EMR(1000 anchors), Eud and LSH with 128 and 256 code length.The best performance of each line is in bold font. EMR andLSH-128 have close online retrieval time, which is greatlyfaster than linear scan Eud – about 30 times faster. LSHhas very small training cost as its hashing functions arerandomly selected, while EMR needs more time to buildthe model. With more offline building cost, EMR receiveshigher retrieval performance in metric of precision, NDCGat 100 and MAP. The offline cost is valuable. The numberwith ∗ means it is significant higher than Eud at the 0.001significance level.5.3.2 Case StudyFig. 9 is an out-of-sample retrieval case with Fig. 9(a) usingEuclidean distance to measure the similarity and Fig. 9(b)using EMR with 400 anchors and Fig. 9(c) with 600 anchors.Since the database structure is simple, we just need to usea small number of anchors to build our anchor graph.When we use 400 anchors, we have received a good result(Fig. 9(b)). Then, when we use more anchors, we can get abetter result. It is not hard to see that, the results of Fig. 9(b)and (c) are all correct, but the quality of Fig. 9(c) is a littlebetter – the digits are more similar with the query.5.4 Experiments on Large Scale DatabasesIn our consideration, the issue of performance shouldinclude both efficiency and effectiveness. Since our methodis designed to speedup the model ’manifold ranking’, theefficiency is the main point of this paper. The first severalexperiments are used to show that our model is muchfaster than MR in both offline training and online retrievalprocesses, with only a small accuracy loss. The originalMR model can not be directly applied to a large data set,e.g., a data set with 1 million samples. Thus, to show theperformance of our method for large data sets, we comparemany state-of-the-art hash-based fast nearest neighborsearch algorithms (our ranking model can naturally do theTABLE 5Out-of-Sample Retrieval Time (s) and Retrieval PerformanceComparisons of EMR (1k Anchors), Eud and LSH with128 and 256 Code Length on MNIST DatabaseThe best performance is in bold font. The number with means it is significanthigher than Eud at the 0.001 significance level.112 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015Fig. 9. Top retrieved MNIST digits via (a) Euclidean distance, (b) EMR with 400 anchor points, and (c) EMR with 600 anchor points. The digit in thefirst line is a new query and the rest digits are the top returns.work of nearest neighbor search) on SIFT1M and ImageNetdatabases.For these two sets, there is no exact labels, so we followthe criterion used in many previous fast nearest neighborsearch work [46]: the groundtruth neighbors are obtainedby brute force search. We use the top-1 percent nearestneighbors as groundtruth. We record the computationaltime (offline training and online retrieval) and rankingperformance in Tables 6 and 7. The offline time is for trainingand the online time is for a query retrieval (averaged).We randomly select 1,000 images from the database asout-of-sample queries and evaluate the performance.For comparison, some state-of-the-art hashing methodsincluding LSH, Spectral Hashing [46] and SphericalHashing (a very recent proposed method [47]) are used.For EMR, we select 10% of the database samples to run kmeansalgorithm with maximum 5 iterations,which is veryfast. In the online stage, the hamming distances betweenthe query sample and the database samples are calculatedfor LSH, Spectral hashing and Spherical Hashing and thenthe distances are sorted. While for our method, we directlycompute the scores via Eq.(17) and sort them. If we adoptany filtering strategy to reduce the number of candidatesamples, the computational cost for each method would bereduced equally. So we only compare the largest computationalcost (brute force search). We adopt 64-bit binarycodes for SIFT1M and 128-bit for ImageNet for all the hashmethods.From Tables 6 and 7, we find that EMR has a comparableonline query cost, and a high nearest neighborsearch accuracy, especially on the high dimensional dataset ImageNet, showing its good performance.TABLE 6Computational Time (s) and Retrieval PerformanceComparison of EMR (1k Anchors), and LSH andSpherical Hash on SIFT1M Database(1 Million-Sample, 128-Dim)5.5 Algorithm AnalysisFrom the comprehensive experimental results above, weget a conclusion that our algorithm EMR is effective andefficient. It is appropriate for CBIR since it is friendly tonew queries. A core point of the algorithm is the anchorpoints selection. Two issues should be further discussed: thequality and the number of anchors. Obviously, our goal isto select less anchors with higher quality. We discuss themas follows:• How to select good anchor points? This is an openquestion. In our method, we use k-means clusteringcenters as anchors. So any faster or better clusteringmethods do help to the selection. There is a tradeoffbetween the selection speed and precision. However,the k-means centers are not perfect – some clustersare very close while some clusters are very small.There is still much space for improvement.• How many anchor points we need? There isno standard answer but our experiments providesome clues: SIFT1M and ImageNet databasesare larger than COREL, but they need similarnumber of anchors to receive acceptable results,i.e., the required number of anchors is not proportionalto the database size. This is important,otherwise EMR is less useful. The numberof anchors is determined by the intrinsic clusterstructure.6 CONCLUSIONIn this paper, we propose the Efficient Manifold Rankingalgorithm which extends the original manifold ranking toTABLE 7Computational Time (s) and Retrieval PerformanceComparison of EMR (1k Anchors), and LSHand Spherical Hash on ImageNet Database(1.2 Million-Sample, 1k-Dim)XU ET AL.: EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL 113handle large scale databases. EMR tries to address theshortcomings of original manifold ranking from two perspectives:the first is scalable graph construction; and thesecond is efficient computation, especially for out-of-sampleretrieval. Experimental results demonstrate that EMR is feasibleto large scale image retrieval systems – it significantlyreduces the computational time.ACKNOWLEDGMENTSThis work was supported in part by National NaturalScience Foundation of China under Grant 61125203,91120302, 61173186, 61222207, and 61173185, and inpart by the National Basic Research Program of China(973 Program) under Grant 2012CB316400, FundamentalResearch Funds for the Central Universities, Programfor New Century Excellent Talents in University(NCET-09-0685), Zhejiang Provincial Natural ScienceFoundation under Grant Y1101043 and Foundation ofZhejiang Provincial Educational Department under GrantY201018240.

Effective Key Management in Dynamic Wireless Sensor Network

Effective Key Management in DynamicWireless Sensor NetworksAbstract—Recently, wireless sensor networks (WSNs) havebeen deployed for a wide variety of applications, includingmilitary sensing and tracking, patient status monitoring, trafficflow monitoring, where sensory devices often move betweendifferent locations. Securing data and communications requiressuitable encryption key protocols. In this paper, we propose acertificateless-effective key management (CL-EKM) protocol forsecure communication in dynamic WSNs characterized by nodemobility. The CL-EKM supports efficient key updates when anode leaves or joins a cluster and ensures forward and backwardkey secrecy. The protocol also supports efficient key revocationfor compromised nodes and minimizes the impact of a nodecompromise on the security of other communication links.A security analysis of our scheme shows that our protocol is effectivein defending against various attacks.We implement CL-EKMin Contiki OS and simulate it using Cooja simulator to assess itstime, energy, communication, and memory performance.Index Terms—Wireless sensor networks, certificateless publickey cryptography, key management scheme.I. INTRODUCTIONDYNAMIC wireless sensor networks (WSNs), whichenable mobility of sensor nodes, facilitate wider networkcoverage and more accurate service than static WSNs. Therefore,dynamic WSNs are being rapidly adopted in monitoringapplications, such as target tracking in battlefield surveillance,healthcare systems, traffic flow and vehicle status monitoring,dairy cattle health monitoring [9]. However, sensor devicesare vulnerable to malicious attacks such as impersonation,interception, capture or physical destruction, due to theirunattended operative environments and lapses of connectivityin wireless communication [20]. Thus, security is one ofthe most important issues in many critical dynamic WSNapplications. DynamicWSNs thus need to address key securityrequirements, such as node authentication, data confidentialityand integrity, whenever and wherever the nodes move.To address security, encryption key management protocolsfor dynamic WSNs have been proposed in the past basedManuscript received August 6, 2014; revised October 17, 2014; acceptedNovember 18, 2014. Date of publication December 4, 2014; date of currentversion January 13, 2015. This work was supported in part by the BrainKorea 21 Plus Project. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Kui Q. Ren.S.-H. Seo is with the Center for Information Security Technologies, KoreaUniversity, Seoul 136-701, Korea (e-mail: seosh77@gmail.com).J. Won, S. Sultana, and E. Bertino are with the Department ofComputer Science, Purdue University, West Lafayette, IN 47907 USA(e-mail: won12@purdue.edu; ssultana@purdue.edu; bertino@purdue.edu).Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIFS.2014.2375555on symmetric key encryption [1]–[3]. Such type of encryptionis well-suited for sensor nodes because of their limitedenergy and processing capability. However, it suffers from highcommunication overhead and requires large memory space tostore shared pairwise keys. It is also not scalable and notresilient against compromises, and unable to support nodemobility. Therefore symmetric key encryption is not suitablefor dynamic WSNs. More recently, asymmetric key basedapproaches have been proposed for dynamic WSNs [4]–[7],[10], [15], [18], [25], [27]. These approaches take advantageof public key cryptography (PKC) such as elliptic curvecryptography (ECC) or identity-based public key cryptography(ID-PKC) in order to simplify key establishment anddata authentication between nodes. PKC is relatively moreexpensive than symmetric key encryption with respect tocomputational costs. However, recent improvements in theimplementation of ECC [11] have demonstrated the feasibilityof applying PKC to WSNs. For instance, the implementationof 160-bit ECC on an Atmel AT-mega 128, which has an8-bit 8 MHz CPU, shows that an ECC point multiplicationtakes less than one second [11]. Moreover, PKC is moreresilient to node compromise attacks and is more scalableand flexible. However, we found the security weaknessesof existing ECC-based schemes [5], [10], [25] that theseapproaches are vulnerable to message forgery, key compromiseand known-key attacks. Also, we analyzed the critical securityflaws of [15] that the static private key is exposed to the otherwhen both nodes establish the session key. Moreover, theseECC-based schemes with certificates when directly appliedto dynamic WSNs, suffer from the certificate managementoverhead of all the sensor nodes and so are not a practicalapplication for large scale WSNs. The pairing operationbasedID-PKC [4], [18] schemes are inefficient due to thecomputational overhead for pairing operations. To the best ofour knowledge, efficient and secure key management schemesfor dynamic WSNs have not yet been proposed.In this paper, we present a certificateless effective keymanagement (CL-EKM) scheme for dynamic WSNs. In certificatelesspublic key cryptography (CL-PKC) [12], the user’sfull private key is a combination of a partial private keygenerated by a key generation center (KGC) and the user’s ownsecret value. The special organization of the full private/publickey pair removes the need for certificates and also resolves thekey escrow problem by removing the responsibility for theuser’s full private key. We also take the benefit of ECC keysdefined on an additive group with a 160-bit length as secureas the RSA keys with 1024-bit length.1556-6013 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.372 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015In order to dynamically provide both node authenticationand establish a pairwise key between nodes, we buildCL-EKM by utilizing a pairing-free certificateless hybridsigncryption scheme (CL-HSC) proposed by us in an earlierwork [13], [14]. Due to the properties of CL-HSC, thepairwise key of CL-EKM can be efficiently shared betweentwo nodes without requiring taxing pairing operations andthe exchange of certificates. To support node mobility, ourCL-EKM also supports lightweight processes for cluster keyupdates executed when a node moves, and key revocation isexecuted when a node is detected as malicious or leaves thecluster permanently. CL-EKM is scalable in case of additionsof new nodes after network deployment. CL-EKM is secureagainst node compromise, cloning and impersonation, andensures forward and backward secrecy. The security analysisof our scheme shows its effectiveness. Below we summarizethe contributions of this paper:• We show the security weaknesses of existingECC based key management schemes for dynamicWSNs [10], [15], [25].• We propose the first certificateless effective keymanagement scheme (CL-EKM) for dynamic WSNs.CL-EKM supports four types of keys, each of whichis used for a different purpose, including securepair-wise node communication and group-oriented keycommunication within clusters. Efficient key managementprocedures are defined as supporting node movementsacross different clusters and key revocation process forcompromised nodes.• CL-EKM is implemented using Contiki OS and use aTI exp5438 emulator to measure the computation andcommunication overhead of CL-EKM. Also we developa simulator to measure the energy consumption ofCL-EKM. Then, we conduct the simulation of nodemovement by adopting the RandomWalk Mobility Modeland the Manhattan Mobility Model within the grid. Theexperimental results show that our CL-EKM scheme islightweight and hence suitable for dynamic WSNs.The remainder of this paper is organized as follows:In Section 2, we briefly discuss related work and show thesecurity weaknesses of the existing schemes. In Section 3, weprovide our network model and adversary model. In Section 4,we provide an overview of our CL-EKM. In Section 5, weintroduce the details of CL-EKM. In Section 6, we analyzethe security of CL-EKM. In Section 7, we evaluate theperformance of CL-EKM, conduct the simulation of nodemovement in Section 8, and conclude in Section 9.II. RELATED WORKSymmetric key schemes are not viable for mobile sensornodes and thus past approaches have focused only on staticWSNs. A few approaches have been proposed based on PKCto support dynamic WSNs. Thus, in this section, we reviewprevious PKC-based key management schemes for dynamicWSNs and analyze their security weaknesses or disadvantages.Chuang et al. [7] and Agrawal et al. [8] proposed atwo-layered key management scheme and a dynamickey update protocol in dynamic WSNs based on theDiffie-Hellman (DH), respectively. However, bothschemes [7], [8] are not suited for sensors with limitedresources and are unable to perform expensive computationswith large key sizes (e.g. at least 1024 bit). Since ECC iscomputationally more efficient and has a short key length(e.g. 160 bit), several approaches with certificate [5], [10],[15], [25] have been proposed based on ECC. However,since each node must exchange the certificate to establishthe pairwise key and verify each other’s certificate beforeuse, the communication and computation overhead increasedramatically. Also, the BS suffers from the overhead ofcertificate management. Moreover, existing schemes [5], [10],[15], [25] are not secure. Alagheband et al. [5] proposed a keymanagement scheme by using ECC-based signcryption, butthis scheme is insecure against message forgery attacks [16].Huang et al. [15] proposed a ECC-based key establishmentscheme for self-organizing WSNs. However, we found thesecurity weaknesses of their scheme. In step 2 of their scheme,a sensor node U sends z = qU · H(MacKey) + dU (modn)to the other node V for authentication, where qU is astatic private key of U. But, once V receives the z, itcan disclose qU, because V already got MacKey anddU in step 1. So, V can easily obtain qU by computingqU = (z dU) · H(MacKey)−1. Thus, the sensor node’sprivate key is exposed to the other node during the keyestablishment between two nodes. Zhang et al. [10] proposeda distributed deterministic key management scheme based onECC for dynamic WSNs. It uses the symmetric key approachfor sharing the pairwise key for existing nodes and uses anasymmetric key approach to share the pairwise keys for anew node after deployment. However, since the initial key KIis used to compute the individual keys and the pairwise keysafter deployment for all nodes, if an adversary obtains KI, theadversary has the ability to compute all individual keys andthe pairwise keys for all nodes. Thus, such scheme suffersfrom weak resilience to node compromises. Also, sincesuch scheme uses a simple ECC-based DH key agreementby using each node’s long-term public key and privatekey, the shared pairwise key is static and as a result, isnot secure against known-key attacks and cannot providere-key operation. Du et al. [25] use a ECDSA scheme toverify the identity of a cluster head and a static EC-Diffie-Hellman key agreement scheme to share the pairwise keybetween the cluster heads. Therefore, the scheme by Duet al. is not secure against known-key attacks, because thepairwise key between the cluster heads is static. On the otherhand, Du et al. use a modular arithmetic-based symmetrickey approach to share the pairwise key between a sensornode and a cluster head. Thus, a sensor node cannot directlyestablish a pairwise key with other sensor nodes and, instead,it requires the support of the cluster head. In their scheme, inorder to establish a pairwise key between two nodes in thesame cluster, the cluster head randomly generates a pairwisekey and encrypts it using the shared keys with these twonodes. Then the cluster head transmits the encrypted pairwisekey to each node. Thus, if the cluster head is compromised,the pairwise keys between non-compromised sensor nodesin the same cluster will also be compromised. Therefore,SEO et al.: EFFECTIVE KEY MANAGEMENT IN DYNAMIC WSNs 373Fig. 1. Heterogeneous dynamic wireless sensor network.their scheme is not compromise-resilient against clusterhead capture, because the cluster head randomly generates apairwise key between sensor nodes whenever it is requestedby the nodes. Moreover, in their scheme, in order to share apairwise key between two nodes in different clusters, thesetwo nodes must communicate via their respective clusterheads. So, after one cluster head generates the pairwisekey for two nodes, the cluster head must securely transmitthis key to both its node and the other cluster head. Thus,this pairwise key should be encrypted by using the sharedpairwise key with the other cluster head and the shared keywith its node, respectively. Therefore, if the pairwise keybetween the cluster heads is exposed, all pairwise keys of thetwo nodes in different clusters are disclosed. The scheme byDu et al. supports forward and backward secrecy by using akey update process whenever a new node joins the clusteror if a node is compromised. However, the scheme does notprovide a process to protect against clone and impersonationattack.Most recently, Rahman et al. [4] and Chatterjee et al. [18]have proposed ID-PKC based key management schemessupporting the mobility of nodes in dynamic WSNswhich removes the certificate management overhead.However, their schemes require expensive pairing operations.Although many approaches that enable pairing operations forsensor nodes have been proposed, the computational costrequired for pairing is still considerably higher than standardoperations such as ECC point multiplication. For example,NanoECC, which uses the MIRACL library, takes around17.93s to compute one pairing operation and around 1.27s tocompute one ECC point multiplication on the MICA2(8MHz)mote [17].III. NETWORK AND ADVERSARY MODELSA. Network ModelWe consider a heterogeneous dynamic wireless sensornetwork (See Fig. 1). The network consists of a number ofstationary or mobile sensor nodes and a BS that manages thenetwork and collects data from the sensors. Sensor nodes canbe of two types: (i) nodes with high processing capabilities,referred to as H-sensors, and (ii) nodes with low processingcapabilities, referred to as L-sensors. We assume to haveN nodes in the network with a number N1 of H-sensorsand a number N2 of L-sensors, where N = N1 + N2, andN1 _ N2. Nodes may join and leave the network, and thusthe network size may dynamically change. The H-sensors actas cluster heads while L-sensors act as cluster members. Theyare connected to the BS directly or by a multi-hop path throughother H-sensors. H-sensors and L-sensors can be stationary ormobile. After the network deployment, each H-sensor formsa cluster by discovering the neighboring L-sensors throughbeacon message exchanges. The L-sensors can join a cluster,move to other clusters and also re-join the previous clusters.To maintain the updated list of neighbors and connectivity,the nodes in a cluster periodically exchange very lightweightbeacon messages. The H-sensors report any changes in theirclusters to the BS, for example, when a L-sensor leaves orjoins the cluster. The BS creates a list of legitimate nodes,M, and updates the status of the nodes when an anomalynode or node failure is detected. The BS assigns each nodea unique identifier. A L-sensor nLi is uniquely identified bynode ID Li whereas a H-sensor nHj is assigned a node ID Hj .A Key Generation Center (KGC), hosted at the BS, generatespublic system parameters used for key management by theBS and issues certificateless public/private key pairs for eachnode in the network. In our key management system, a uniqueindividual key, shared only between the node and the BS isassigned to each node. The certificateless public/private keyof a node is used to establish pairwise keys between any twonodes. A cluster key is shared among the nodes in a cluster.B. Adversary Model and Security RequirementsWe assume that the adversary can mount a physical attackon a sensor node after the node is deployed and retrieve secretinformation and data stored in the node. The adversary can alsopopulate the network with the clones of the captured node.Even without capturing a node, an adversary can conduct animpersonation attack by injecting an illegitimate node, whichattempts to impersonate a legitimate node. Adversaries canconduct passive attacks, such as, eavesdropping, replay attack,etc to compromise data confidentiality and integrity. Specificto our proposed key management scheme, the adversary canperform a known-key attack to learn pairwise master keys if itsomehow learns the short-term keys, e.g., pairwise encryptionkeys. As described in [26] and [8], in order to provide a securekey management scheme for WSNs supporting mobile nodes,the following security properties are critical:• Compromise-Resilience: A compromised node must notaffect the security of the keys of other legitimate nodes.In other words, the compromised node must not be ableto reveal pairwise keys of non-compromised nodes. Thecompromise-resilience definition does not mean that anode is resilient against capture attacks or that a capturednode is prevented from sending false data to other nodes,BS, or cluster heads.• Resistance Against Cloning and Impersonation: Thescheme must support node authentication to protectagainst node replication and impersonation attacks.• Forward and Backward Secrecy: The scheme must assureforward secrecy to prevent a node from using an oldkey to continue decrypting new messages. It must alsoassure backward secrecy to prevent a node with the newkey from going backwards in time to decrypt previouslyexchanged messages encrypted with prior keys. forwardand backward secrecy are used to protect against nodecapture attacks.374 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015• Resilience Against Known-Key Attack: The scheme mustbe secure against the known-key attack.IV. OVERVIEW OF THE CERTIFICATELESS EFFECTIVEKEY MANAGEMENT SCHEMEIn this paper, we propose a Certificateless Key Managementscheme (CL-EKM) that supports the establishment of fourtypes of keys, namely: a certificateless public/private key pair,an individual key, a pairwise key, and a cluster key. Thisscheme also utilizes the main algorithms of the CL-HSCscheme [13] in deriving certificateless public/private keys andpairwise keys. We briefly describe the major notations usedin the paper (See Table I), the purpose of these keys and howthey are setup.A. Types of KeysCertificateless Public/Private Key: Before a node isdeployed, the KGC at the BS generates a uniquecertificateless private/public key pair and installs the keysin the node. This key pair is used to generate a mutuallyauthenticated pairwise key.• Individual Node Key: Each node shares a uniqueindividual key with BS. For example, a L-sensor can usethe individual key to encrypt an alert message sent tothe BS, or if it fails to communicate with the H-sensor.An H-sensor can use its individual key to encrypt themessage corresponding to changes in the cluster. TheBS can also use this key to encrypt any sensitive data,such as compromised node information or commands.Before a node is deployed, the BS assigns the node theindividual key.• Pairwise Key: Each node shares a different pairwise keywith each of its neighboring nodes for secure communicationsand authentication of these nodes. For example, inorder to join a cluster, a L-sensor should share a pairwisekey with the H-sensor. Then, the H-sensor can securelyencrypt and distribute its cluster key to the L-sensorby using the pairwise key. In an aggregation supportiveWSN, the L-sensor can use its pairwise key to securelytransmit the sensed data to the H-sensor. Each nodecan dynamically establish the pairwise key between itselfand another node using their respective certificatelesspublic/private key pairs.• Cluster Key: All nodes in a cluster share a key, named ascluster key. The cluster key is mainly used for securingbroadcast messages in a cluster, e.g., sensitive commandsor the change of member status in a cluster. Only thecluster head can update the cluster key when a L-sensorleaves or joins the cluster.V. THE DETAILS OF CL-EKMThe CL-EKM is comprised of 7 phases: system setup,pairwise key generation, cluster formation, key update, nodemovement, key revocation, and addition of a new node.TABLE ILIST OF NOTATIONSA. System SetupBefore the network deployment, the BS generates systemparameters and registers the node by including it in a memberlist M.1) Generation of System Parameters: The KGC at theBS runs the following steps by taking a security parameterk ∈ Z+ as the input, and returns a list of system parameter_ = {Fq , E/Fq , Gq , P, Ppub = x P, h0, h1, h2, h3} and x.• Choose a k-bit prime q• Determine the tuple {Fq , E/Fq , Gq , P}.• Choose the master private key x R Z∗qand compute thesystem public key Ppub = x P.• Choose cryptographic hash functions {h0, h1,h2, h3} so that h0 : {0, 1}∗ × G2q→ {0, 1}∗, h1 :G3q× {0, 1}∗ × Gq → {0, 1}n, h2 : Gq × {0, 1}∗ ×Gq × {0, 1}∗ × Gq × {0, 1}∗ × Gq → Z∗q, andh3 : Gq×{0, 1}∗×Gq×{0, 1}∗×Gq×{0, 1}∗×Gq → Z∗q.Here, n is the length of a symmetric key.The BS publishes _ and keeps x secret.2) Node Registration: The BS assigns a unique identifier,denoted by Li , to each L-sensor nLi and a unique identifier,denoted by Hj , to each H-sensor nHj, where 1 ≤ i N1,1 ≤ j N2, N = N1 + N2. Here we describe the certificatelesspublic/private key and individual node key operationsfor Li , the same mechanisms apply for H-sensors. Duringinitialization, each node nLi chooses a secret value xLiR Z∗qand computes PLi= xLi P. Then, the BS requests the KGCfor partial private/public keys of nLi with the input parametersLi and PLi. The KGC chooses rLiR Z∗qand then computesa pair of partial public/private key (RLi , dLi ) as below:RLi= rLi PdLi= rLi+ x · h0(Li , RLi , PLi ) mod qThe Li can validate its private key by checking whetherthe condition dLi P = RLi+ h0(Li , RLi , PLi )Ppub holds.SEO et al.: EFFECTIVE KEY MANAGEMENT IN DYNAMIC WSNs 375Li then sets skLi= (dLi , xLi ) as its full private key andpkLi= (PLi , RLi ) as its full public key. The BS also choosesa uniform random number x0 ∈ Z∗qto generate the node’sindividual key K0Li(K0Hjfor nHj ). The individual key iscomputed as an HMAC of x0, Li as followsK0Li= HMAC(x0, Li )After the key generation for all the nodes, the BS generatesa member list M consisting of identifiers and public keysof all these nodes. It also initializes a revocation list R thatenlists the revoked nodes. The public/private key, _, and theindividual key are installed in the memory of each node.B. Pairwise Key GenerationAfter the network deployment, a node may broadcast anadvertisement message to its neighborhood to trigger thepairwise key setup with its neighbors. The advertisementmessage contains its identifier and public key. At first, twonodes set up a long-term pairwise master key between them,which is then used to derive the pairwise encryption key. Thepairwise encryption key is short-term and can be used as asession key to encrypt sensed data.1) Pairwise Master Key Establishment: In this paragraph,we describe the protocol for establishing a pairwise master keybetween any two nodes nA and nB with unique IDs A and B,respectively.We utilize the CL-HSC scheme [13] as a buildingblock. When nA receives an advertisement message from nB,it executes the following encapsulation process to generate along-term pairwise master key KAB and the encapsulated keyinformation, ϕA = (UA,WA).• Choose lA R Z∗qand compute UA = lAP.• ComputeTA = lA · h0(B, RB, PB)Ppub + lA · RB mod qKAB = h1(UA, TA, lA · PB, B, PB)• Computeh = h2(UA, τA, TA, A, PA, B, PB)h_ = h3(UA, τA, TA, A, PA, B, PB)WA = dA + lA · h + xA · h_where τA is a random string to give a freshness.• Output KAB and ϕA = (UA,WA).Then, nA sends A, pkA, τA and ϕA to nB. nB then performsdecapsulation to obtain KAB.• Compute TA = dB · UA.Note: Because of dB = rB + x · h0(B, RB, PB) andUA = lAP mod q, TA is computed as TA = (rB + x ·h0(B, RB, PB)) · lAP mod q = lA · h0(B, RB, PB)Ppub +lA · RB mod q,• Compute h = h2(UA, τA, TA, A, PA, B, PB) andh_ = h3(UA, τA, TA, A, PA, B, PB).• If WA · P = RA +h0(A, RA, PA) · Ppub +h ·UA +h_ · PA,output KAB = h1(UA, TA, xB · UA, B, PB). Otherwise,output invalid.TABLE IICLUSTER FORMATION PROCESS2) Pairwise Encryption Key Establishment: OncenA and nB set the pairwise master key KAB, they generatean HMAC of KAB and a nonce r R Z∗q. The HMAC is thenvalidated by both nA and nB. If the validation is successful,the HMAC value is established as the short-term pairwiseencryption key kAB. The process is summarized below:• nB chooses a random nonce r R Z∗q, computeskAB = HMAC(KAB, r ) and C1 = EkAB (r, A, B). Then,nB sends r and C1 to nA.• When nA receives r and C1, it computeskAB = HMAC(KAB, r ) and decrypts C1. Then itvalidates r , A and B and if valid confirms that nB knowsKAB and it can compute kAB.C. Cluster FormationOnce the nodes are deployed, each H-sensor discoversneighboring L-sensors through beacon message exchanges andthen proceeds to authenticate them. If the authentication issuccessful, the H-sensor forms a cluster with the authenticatedL-sensors and they share a common cluster key. TheH-sensor also establishes a pairwise key with each memberof the cluster. To simplify the discussion, we focus on theoperations within one cluster and consider the j th cluster.We also assume that the cluster head H-sensor is nHj withnLi (1 ≤ i n) as cluster members. nHj establishes a clusterkey GKj for secure communication in the cluster. Table IIshows the cluster formation process.1) Node Discovery and Authentication: For node discovery,nHj broadcasts an advertisement message containingHj and pkHj. Once nLi within Hj ’s radio range receivesthe advertisement, it checks Hj and pkHj , and initiatesthe Pairwise Key Generation procedure. Note that nLi mayreceive multiple advertisement messages if it is within therange of more than one H-sensor. However, nLi must chooseone H-sensor, may be by prioritizing over the proximity andsignal strength. Additionally, nLi can record other H-sensoradvertisements as backup cluster heads in the event that theprimary cluster head is disabled. If nLi selects multiple clusterheads and sends a response to all of them, it is considered as376 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015a compromised node. nLi and nHj perform the Pairwise KeyGeneration procedure to obtain a pairwise master key, KLi Hjand a pairwise encryption key, kLi Hj .2) Cluster Key Generation: nHj chooses x j R Z∗qtogenerate a cluster key GKj as followsGKj = HMAC(x j , Hj )Then, nHj computes C2 = EkLi Hj(GKj , Hj , Li ) to distributethe GKj. Then nHj sends Hj and C2 to nLi . nLi decryptsC2 to recover Hj , Li and GKj by using kLi Hj . If nLi fails tocheck Hj , Li , it discards the message and reports nHj to theBS as an illegitimate cluster head. Otherwise, nLi confirms thatnHj is valid and can compute GKj . Then, nLi stores GKj asa cluster key. Next, nLi computes HMAC(kLiHj ,GKj ) andC3 = EkLi Hj(Li ,HMAC(kLi Hj ,GKj )). It transmits C3 andLi to nHj. After nHj receives messages from nLi , it decryptsC3 by using kLi Hj . Then it checks Li and the validity ofHMAC(kLiHj ,GKj ). If the validity check fails, nHj discardsthe message. Otherwise, nHj can confirm that nLi shares thevalid GKj and kHj Li . nHj adds Li and pkLi on member listof the j th cluster, Mj .3) Membership Validation: After discovering all the neighboringnodes nLi (1 ≤ i n) in the j th cluster, nHj computesC4 = EK0Hj(Hj ,Mj ) and transmits C4 and Hj to the BS.After receiving messages from nHj , the BS checks the validityof the nodes listed in Mj . If all nodes are legitimate, the BSsends an acknowledgement to nHj . Otherwise, the BS rejectsMj and investigates the identities of invalid nodes (false orduplicate ID). Then, the BS adds the identities of invalid nodesto the revocation list and reports it to nHj . Upon receiving theacknowledge message, nHj computes C5 = EGKj (Hj ,Mj )and broadcasts C5 to all the nodes in j th cluster.D. Key UpdateIn order to protect against cryptanalysis and mitigatedamage from compromised keys, frequent encryption keyupdates are commonly required. In this section we providethe pairwise key update and cluster key update operations.1) Pairwise Key Update: To update a pairwise encryptionkey, two nodes which shared the pairwise key perform aPairwise Encryption Key Establishment process. On the otherhand, the pairwise master key does not require periodicalupdates, because it is not directly used to encrypt each sessionmessage. As long as the nodes are not compromised, the pairwisemaster keys cannot be exposed. However, if a pairwisemaster key is modified or needs to be updated according tothe policy of the BS, the Pairwise Master Key Establishmentprocess must be executed.2) Cluster Key Update: Only cluster head H-sensors canupdate their cluster key. If a L-sensor attempts to changethe cluster key, the node is considered a malicious node.The operation for any j th cluster is described as follows:1) nHj chooses x_jR Z∗qand computes a new cluster keyGK_j= HMAC(x_j , Hj ). nHj also generates an Updatemessage including HMAC(GK_j ,Update) and computesC6 = EGKj (GK_j ,HMAC(GK_j ,Update)). Then, nHjtransmits Update and C6 to its cluster members.2) Each member nLi decrypts C6 using the GKj , verifiesHMAC(GK_j ,Update) and updates a cluster key as GK_j .Then, each nLi sends the acknowledgement message to nHj .E. Node MovementWhen a node moves between clusters, the H-sensorsmust properly manage the cluster keys to ensure theforward/backward secrecy. Thus, the H-sensor updates thecluster key and notifies the BS of the changed node status.Through this report, the BS can immediately update the nodestatus in the M. We denote a moving node as nLm .1) Node Leave: A node may leave a cluster due to nodefailure, location change or intermittent communication failure.There are both proactive and reactive ways for the cluster headto detect when a node leaves the cluster. The proactive caseoccurs when the node nLm actively decides to leave the clusterand notifies the cluster head nHj or the cluster head decidesto revoke the node. Since in this case nHj can confirm that thenode has left, it transmits a report EK0Hj(NodeLeave, Lm) toinform the BS that nLm has left the cluster. After receivingthe report, the BS updates the status of nLm in M and sendsan acknowledgement to nHj . The reactive case occurs whenthe cluster head nHj fails to communicate with nLm. It mayhappen that a node dies out of battery power, fails to connectto nHj due to interference or obstacles, is captured by theattacker or is moved unintentionally. Since the nodes in acluster periodically exchange lightweight beacon messages,nHj can detect a disappeared node nLm when it does notreceive the beacon message from nLm for a predeterminedtime period. So, nHj reports the status of the node nLmto the BS by sending EK0Hj(NodeDisappear, Lm). Whenthe BS receives the report, it updates the status of nLm inthe M and acknowledges to nHj. Once nHj receives theacknowledgement from the BS, it changes its cluster keywith the following operations: 1) nHj chooses a new clusterkey GK_j and computes EkLi Hj(GK_j , NodeLeave, Lm) usingpairwise session keys with each node in its cluster, except nLm .2) Then, nHj sends EkLi Hj(GK_j , NodeLeave, Lm) to eachmember node except nLm . 3) Each nLi decrypts it using kLi Hjand updates the cluster key as GK_j .2) Node Join: Once the moving node nLm leaves a cluster,it may join other clusters or return to the previous cluster aftersome period. For the sake of simplicity, we assume that nLmwants to join the lth cluster or return to the j th cluster.(i) Join a New Cluster: nLm sends a join request whichcontains Ln+1 and pkLn+1 to join a lth cluster. After nHlreceives the join request, nLm and nHl perform PairwiseKey Generation procedure to generate KLm Hl and kLm Hl ,respectively. Next, nHl transmits EK0Hl(NodeJoin, Lm)to the BS. The BS decrypts the message and validateswhether nLm is a legitimate node or not and sends anacknowledgement to nHl if successful. The BS alsoupdates the node member list, M. In case of nodevalidation failure at the BS, nHl stops this processand revokes the pairwise key with nLm. Once nHlSEO et al.: EFFECTIVE KEY MANAGEMENT IN DYNAMIC WSNs 377receives the acknowledgement, it performs the ClusterKey Update process with all other nodes in the cluster.nHl also computes EkLm Hl(GK_l , Hl , Lm), and sends itto the newly joined node nLm .(ii) Return to the Previous Cluster: nLm sends a join requestwhich contains Ln+1 and pkLn+1 to join a j th cluster.Once nHj receives the join request, it checks a timerfor nLm which is initially set to the Thold . Thold indicatesthe waiting time before discarding the pairwise masterkey when a L-sensor leaves. If nLm returns to the j thcluster before the timer expires, nLm and nHj performonly the Pairwise Encryption Key Establishment procedureto create a new pairwise encryption key, k_LmHj.Otherwise, they perform the Pairwise Key Generationprocedure to generate a new K_LmHland k_LmHl, respectively.Then, the cluster head nHj also updates the clusterkey to protect backward key secrecy. Before updatingthe cluster key, nHj transmits EK0Hj(NodeReJoin, Lm)to the BS. Once the BS decrypts the message anddetermines that nLm is a valid node, the BS sends theacknowledgement to nHl . The BS then updates the memberlist M. Once nHl receives the acknowledgement,it performs the Cluster Key Update process with allother nodes in the cluster. Afterwards, nHj computesEk_Lm Hj(GK_j , Hj , Lm) and sends it to nLm .F. Key RevocationWe assume that the BS can detect compromisedL-sensors and H-sensors. The BS may have an intrusiondetection system or mechanism to detect malicious nodes oradversaries [19], [20]. Although we do not cover how the BScan discover a compromised node or cluster head in this paper,the BS can utilize the updated node status information of eachcluster to investigate an abnormal node. In our protocol, acluster head reports the change of its node status to the BS,such as whenever a node joins or leaves a cluster. Thus, the BScan promptly manage the node status in the member list, M.For instance, the BS can consider a node as compromisedif the node disappears for a certain period of time. In thatcase, the BS must investigate the suspicious node and itcan utilize the node fault detection mechanism introducedin [21] and [22]. In this procedure, we provide a key revocationprocess to be used when the BS discovers a compromised nodeor a compromised cluster head. We denote a compromisednode by nLc in the j th cluster for a compromise node caseand a compromised head by nHj for a compromise clusterhead case.1) Compromised Node: The BS generates a CompNodemessage and a EK0Hj(CompNode, Lc). Then it sendsEK0Hj(CompNode, Lc) to all nHj , (1 ≤ j N2). After allH-sensors decrypt the message, they update the revocationlist of their clusters. Then, if related keys with nLc exist, therelated keys are discarded. Other than nLc , nHj performs theNode leave operations to change the current cluster key withthe remaining member nodes.2) Compromised Cluster Head: After the BS generates aCompHeader message and a EK0Li(CompHeader, Hj), itsends the message to all nLi (1 ≤ i n) in the j th cluster. TheBS also computes EK0Hi(CompHeader, Hj), (1 ≤ i N2,i _= j ) and transmits it to all H-sensors except nHj. Onceall nodes decrypt the message, they discard the related keyswith nHj . Then, each nLi attempts to find other neighboringcluster heads and performs the Join other cluster steps of theNode join process with the neighboring cluster head. If somenode nLi is unable to find another cluster head node, it mustnotify the BS by sending EK0Li(FindNewClusteLi ). The BSproceeds to find the nearest cluster head nHn for nLi andconnect nHn with nLi . Then, they can perform the Join othercluster steps.G. Addition of a New NodeBefore adding a new node into an existing networks, theBS must ensure that the node is not compromised. Thenew node nLn+1 establishes a full private/public key throughthe node registration phase. Then, the public systemparameters, a full private/public key and individualkey K0Ln+1are stored into nLn+1 . The BS generatesEK0Hj(NewNode, Ln+1, pkLn+1) and sends it to all nHj ,(1 ≤ j N2). After nLn+1 is deployed in the network,it broadcasts an advertisement message which containsLn+1 and pkLn+1 to join a neighboring cluster. If multipleH-sensors receive nLn+1’s message, they will transmit aResponse message to nLn+1 . nLn+1 must choose one H-sensorfor a valid registration. If nLn+1 selects nHj according to thedistance and the strength of signal, it initiates the PairwiseKey Generation procedure. In order to provide backwardsecrecy, nHj performs Cluster Key Update procedure, wherethe Update message contains Ln+1 and pkLn+1. Then, nHjcomputes C7 = EkLn+1 Hj(GK_j , Hj , Ln+1), and sends C7and Hj to nLn+1. After nLn+1’s registration, nHj transmitsEK0Hj(NodeJoin, Ln+1) to the BS. Once the BS decrypts themessage, it updates the status of the node nLn+1 in memberlist, M.VI. SECURITY ANALYSISFirst, we briefly discuss the security of CL-HSC [13]which is utilized as a building block of CL-EKM. Later,we discuss how CL-EKM achieves our security goals. TheCL-HSC [13] provides both confidentiality and unforgeabilityfor signcrypted messages based on the intractability of theEC-CDH1 Moreover, it is not possible to forge or expose thefull private key of an entity based on the difficulty of EC-CDH,without the knowledge of both KGC’s master private key andan entity’s secret value. Here, the confidentiality is definedas indistinguishability against adaptive chosen ciphertext andidentity attacks (IND-CCA2) while unforgeability is defined1The Elliptic Curve Computational Diffie-Hellman problem (EC-CDH) isdefined as follows: Given a random instance (P,aP, bP) Gq for a,b R Z∗q, compute abP.378 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015as existential unforgeability against adaptive chosen messagesand identity attacks (EUF-CMA). Further details on theCL-HSC scheme and its security proof are provided in [13].A. Compromise-Resilience of CL-EKMWe assume that an adversary captures a node nLi in thej th cluster. This adversary can then extract the keys of nLi ,such as the pairwise key shared with the cluster head nHj ,the public/private key pair, the cluster key GKj, and theindividual key. However, the pairwise master/encryption keygeneration between any two nodes are independent of others,and hence each pair of nodes has different pairwise keys.Therefore, even if the adversary manages to obtain nLi’s keys,it is unable to extract any information useful to compromisethe pairwise keys of other uncompromised nodes. Moreover,due to the intractability of EC-CDH problem, the adversarycannot obtain the KGC’s master private key x from nLi’spublic/private keys pkLi /skLi . As a result, the compromiseof a sensor does not affect the communication security amongother L-sensors or H-sensors. Even though the attacker canread the group communications within the cluster with thecluster key extracted from the compromised node, it cannotget any information about the cluster key of other clusters.B. Resistance Against Cloning and Impersonation AttackAn adversary can conduct the cloning attack if a node iscaptured; the key is then extracted and the node is replicatedin another neighborhood. However, since the cluster headvalidates each node with the BS in the node join process ofour CL-EKM, the BS is able to detect a cloned node when itis placed in an unintended cluster. After the BS investigatesthe cloned node, it revokes the node and notifies the noderevocation to all cluster heads. Thus, although the cloned nodemay try to join other clusters, the cluster head will abort eachattempt. Therefore, our scheme is resistant against the cloningattack.The adversary may also attempt an impersonation attackby inserting an illegitimate node nC. Assume that a node nCposes as nLi . The node ID Li and public key, pkLi=(PLi , RLi ) are publicly known within the network. Hence,nC can broadcast Li and pkLi. When nL j receives themessage, it will compute the pairwise master key KLi L j ,and the encapsulated key information ϕL j= (UL j ,WL j )towards establishing the pairwise Master key. As the next step,nL j sends     ϕL j , L j , pkL j to nC for decapsulation, whichrequires nC to compute TL j as (dLi· UL j ). However, nCfails to compute TL j since nC has no knowledge of nLi’spartial private key dLi . Moreover due to the intractability ofEC-CDH1, the adversary cannot forge dLi without the knowledgeof the KGC’s master private key. Thus, nC is unableto generate a legitimate pairwise master key, KLi L j. However,nC may try to establish the pairwise encryption with a randomkey K_, rather than generating a legitimate master key. To thisend, nC chooses a random nonce r , computes an encryptionkey k_ as HMAC(r, K_) and sends          r, E_k (r, Li , L j ) to nL j .However, nC cannot successfully pass the validation at nL j ,since nL j first computes the pairwise encryption key withnL j as kLi L j= HMAC(r, KLi L j ) and then tries to decryptE_k (r, Li , L j ) using kLi L j . Thus, nL j fails to decrypt andhence, it does not confirm the pairwise encryption key to nC,which is then reported to the BS. Thus, CL-EKM is resistantagainst impersonation attacks.C. Forward and Backward SecrecyIn CL-EKM, messages exchanged between nodes or withina cluster are encrypted with the pairwise encryption key orcluster key. CL-EKM provides the key update and revocationprocesses to ensure forward secrecy when a node leaves orcompromised node is detected. Using key update process,CL-EKM ensures backward secrecy when a new node joins.Once a node is revoked from the network, all its keys areinvalidated and the associated cluster key is updated. Thecluster head sends the new cluster key to each cluster node,except the revoked node, by encrypting the key with thepairwise encryption key between the cluster and each intendednode. Thus, the revoked node fails to decrypt any subsequentmessages using the old pairwise encryption key or cluster key.When a node joins a cluster, the cluster head generates a newcluster key by choosing a new random value. Since the joinednode receives the new cluster key, it cannot decrypt earliermessages encrypted using the older cluster keys.D. Resistance Against Known-Key AttackWe assume that an adversary obtains the current pairwiseencryption key kLi Hj= HMAC(KLi Hj , r ) betweennLi and nHj and conducts the known-key attack. The adversarymay attempt to extract the long term pairwise master keyKLi Hj using kLi Hj . However, due to the one-way featureof HMAC(.), the adversary fails to learn KLi Hj. Also,when nLi and nHj update the pairwise encryption key ask_Li Hj= HMAC(KLi Hj , r _), the adversary cannot computethe updated pairwise encryption key k_Li Hj, without the knowledgeof KLi Hj . Thus, CL-EKM is resistant against known-keyattack when the pairwise encryption key is compromised.VII. PERFORMANCE EVALUATIONWe implemented CL-EKM in Contiki OS [29] andused Contiki port [28] of TinyECC [24] for elliptic curvecryptography library. In order to evaluate our scheme, weuse the Contiki simulator COOJA. We run emulations on thestate-of-the-art sensor platform TI EXP5438 which has 16-bitCPU MSP430F5438A with 256KB flash and 16KB RAM.MSP430F5438A has 25MHz clock frequency and can belowered for power saving.A. Performance Analysis of CL-EKMWe measure the individual performance of the three stepsin the pairwise master/encryption key establishment process,namely, (i) encapsulation, (ii) decapsulation, and (iii) pairwiseencryption key generation. We evaluate each step in termsof (i) computation time, and (ii) energy consumption.In this experiment, we vary the processing power i.e. CPUclock rate of the sensors since we consider heterogeneousSEO et al.: EFFECTIVE KEY MANAGEMENT IN DYNAMIC WSNs 379Fig. 2. Computation overhead for pairwise master/encryption key establishment. (a) Encapsulating key information. (b) Decapsulating key information.(c) Pairwise encryption key establishment.Fig. 3. Energy consumption for pairwise master/encryption key establishment. (a) Encapsulating key information. (b) Decapsulating key information.(c) Pairwise encryption key establishment.WSNs with H-sensors being more powerful. Three differentelliptic curves recommended by SECG (Standards for EfficientCryptography Group) [30], i.e., (i) secp128r2 (128-bit ECC),(ii) secp160r1 (160-bit ECC), and (iii) secp192r1 (192-bitECC), are used for the experiment.Fig. 2 shows the time for the pairwise key generationprocess. As expected, the pairwise master key generationtakes most of the time due to the ECC operations(See Fig. 2(a), 2(b)). However, it is important to mention thatthe pairwise master key is used only to derive the short-termpairwise encryption key. Once two nodes establish the pairwisekeys, they do not require further ECC operations. Fig. 2(a)shows the computation times of the encapsulation process forvarious CPU clock rates of the sensor device. The computationtime increases with the ECC key bit length. secp192r1 needsalmost 1.5 times more time than secp160r1. secp128r2 takesapproximately 4% less time than secp160r1. If CPU clockrate is set to 25MHz and secp160r1 is adopted, 5.7 secondsare needed for encapsulation of key. Fig. 2(b) shows theprocessing time for the decapsulation. Decapsulation requiresabout 1.57 times more CPU computation time than encapsulation.This is because decapsulation has six ECC pointmultiplications, whereas encapsulation includes only four ECCpoint multiplications. Finally, the computation time for pairwiseencryption key establishment is shown in Fig. 2(c).At 25MHz CPU clock rate, it requires 5 ms, which is negligiblecompared to the first two steps. This is due to the factthat this step just needs one HMAC and one 128-bit AESoperation. Next, we measure the energy consumption. As wecan see from Fig. 3, the faster the processing power (i.e. CPUclock rate) is, the more energy is consumed. However, asshown in Fig. 3(a) and Fig. 3(b), there is no differencebetween 16MHz and 25MHz while 25MHz results in fastercomputation than 16MHz. In addition, secp160r1 might bea good choice for elliptic curve selection, since it is moresecure than secp128r2 and consumes reasonable CPU timeand energy for WSNs. In our subsequent experiments, weutilize secp160r1.B. Performance ComparisonsIn this section, we benchmark our scheme with three previousECC-based key management schemes for dynamic WSNs:HKEP [15], MAKM [25] and EDDK [10]. Due to the variabilityof every schemes, we chose to compare a performance ofthe pairwise master key generation step because it is the mosttime consuming portion in each of the schemes. We measuredthe total energy consumption of computation and communicationto establish a pairwise key between two L-sensors. Forthe experiment, we implemented four schemes on TI EXP5438at 25MHz using ECC with secp160r1 parameters andAES-128 symmetric key encryption. EC point is compressedto reduce the packet size and LPL (Low Power Listening)is utilized for power conservation. Thus, sensors wake upfor short durations to check for transmissions every second.If no transmission is detected, they revert to a sleep mode.To compute the energy consumption for communication, weutilize the energy consumption data of CC2420 from [27]and IEEE 802.15.4 protocol overhead data from [31]. We considertwo scenarios as shown in Fig. 4. In the first scenario,two L-sensors lie within a 1-hop range, but the distancebetween the H-sensor and the L-sensor varies from 1 to 8(see Fig. 4 (a)). In the second scenario, two L-sensors andthe H-sensor lie in a 1-hop range, but the wireless channelconditions are changed (see Fig. 4 (b)). When a wirelesschannel condition is poor, a sender may attempt to resend apacket to a destination multiple times. Expected TransmissionCount (ETX) is the expected number of packet transmissionto be received at the destination without error. Fig. 5shows the energy consumption of the four schemes for apairwise key establishment when the number of hops betweenL-sensors and H-sensor, n, increases. When n is one, HKEP380 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015Fig. 4. Network topology.Fig. 5. Energy consumption comparison for pairwise key establishment inscenario (a).requires more energy than our scheme because it performs sixmessage exchanges to establish a pairwise key between twoL-sensors, while our scheme needs just two messageexchanges. When n is one, MAKM consumes the leastenergy because the L-sensor performs a single AESsymmetric encryption, but other schemes run expensive ECCoperations. However, as n increases, the energy of MAKMincreases because the H-sensor is always involved in thegeneration of a pairwise key between two L-sensors. As aresult, MAKM consumes more energy than our scheme whenn is larger than one and the gap also widens when n increases.A packet delivery in a wireless sensor network is unreliabledue to unexpected obstacles, time-varying wireless channelconditions, and a low-power transceiver. Fig. 6 shows theenergy consumption of the four schemes for a pairwisekey establishment when ETX varies from 1 to 4. As ETXincreases, the energy consumption of HKEP increases morerapidly, because it requires six message exchanges. Also,HKEP is insecure, because the static private key of a nodeis exposed to the other node while the two nodes establishthe session key. Although EDDK and MAKM may showbetter performance due to low computational overhead, thedifference between MAKM and our scheme is only 0.121 Jand the difference between EDDK and our scheme is 0.045 J.Both EDDK and MAKM are insecure against the knownkeyattack and do not provide a re-keying operation for thecompromised pairwise key. EDDK also suffers from weakresilience to node compromises. Therefore, this performanceevaluation demonstrates that overall, our scheme outperformsthe existing schemes in terms of a better trade-off between thedesired security properties and energy consumption includingcomputational and communication overhead.VIII. SIMULATION OF NODE MOVEMENTSA. SettingWe developed a simulator which counts the keymanagement-related events and yields total energy consumptionfor key-management-related computations using the datain Sec. 7.1. We focus on the effects of node movement andFig. 6. Energy consumption comparison for pairwise key establishment inscenario (b).Fig. 7. Network topology for simulation.do not consider the impact of lower network layer protocols.We consider a 400×400 m2 space with 25 H-sensors placedon the grid corners (see Fig. 7). In CL-EKM, an H-sensormaintains two timers: Tbackof f and Thold to efficiently managethe cluster when a node moves. Tbackof f denotes the clusterkey update frequency. If Tbackof f = 0, the cluster key isupdated whenever a node joins or leaves. Otherwise, theH-sensor waits a time equal to Tbackof f after a node joinsor leaves to update the cluster key. Thold denotes the waitingtime before discarding the pairwise master key when aL-sensor leaves. If Thold = 0, the pairwise master key witha L-sensor is revoked right after the node leaves the cluster.Otherwise, the H-sensor stores the pairwise master key withthe left L-sensor for a time equal to Thold . For the movementof L-sensor, we adopt two well-known mobility models usedfor simulation of mobile ad-hoc network: the Random WalkMobility Model and the Manhattan Mobility Model [32].H-sensors are set to be stationary since they are usually partof the static infrastructure in real world applications.1) Random Walk Mobility Model: The Random WalkMobility Model mimics the unpredictable movements of manyobjects in nature. In our simulation, 1,000 L-sensors arerandomly distributed. Each L-sensor randomly selects anH-sensor among the four H-sensors in its vicinity and establishesthe pairwise key and cluster key. After the simulationstarts, the L-sensors randomly select a direction and moveat a random speed uniformly selected from [0, 2VL] (i.e., themean speed = VL ). The new direction and speed are randomlyselected every second. If a L-sensor crosses a line, it firstchecks whether it is still connected with its current H-sensor.If not, the node attempts to find an H-sensor which it hadSEO et al.: EFFECTIVE KEY MANAGEMENT IN DYNAMIC WSNs 381Fig. 8. Node movement simulation results in random walk mobility model. (a) Energy consumption of one H-sensor for cluster key update for one day(Thold = 100 sec). (b) Energy consumption of one H-sensor for pairwise key establishment for one day (Tbacko f f = 6 sec). (c) Energy consumption of oneL-sensor for pairwise key establishment for one day (Tbacko f f = 6 sec).Fig. 9. Node movement simulation results in Manhattan mobility model. (a) Energy consumption of one H-sensor for cluster key update for one day(Thold = 100 sec). (b) Energy consumption of one H-sensor for pairwise key establishment for one day (Tbacko f f = 6 sec). (c) Energy consumption of oneL-sensor for pairwise key establishment for one day (Tbacko f f = 6 sec).previously connected to and it still maintains a pairwise masterkey. In the case of a failure, the node randomly selects anH-sensor among the surrounding H-sensors.2) Manhattan Mobility Model: The Manhattan MobilityModel mimics the movement patterns in an urban area organizedaccording to streets and roads. In our simulation, 1,000L-sensors are randomly distributed and move in a grid. Theycan communicate with two adjacent H-sensors. Each L-sensorrandomly selects its direction and chooses an H-sensor withinits path as its cluster head. After the simulation starts, theL-sensors move at a random speed uniformly selected from[0, 2VL]. At each intersection, a L-sensor has 0.5 probabilityof moving straight and a 0.25 probability of turning left orright. If a L-sensor arrives at a new intersection, it first choosesa new direction and checks whether it is still connected withits current H-sensor. If not, it chooses an H-sensor within itsnew vector as its new cluster head.B. The Effect of Tbackof fFig. 8(a) shows the energy consumption of an H-sensorfor a cluster key update during the course of a day ina Random Walk Model. As Tbackof f increases, the energyconsumption decreases since the number of cluster key updatesis reduced. The faster VL is, the more rapidly the energyconsumption decreases as Tbackof f increases since L-sensorsfrequently cross the border lines. This tendency also showsin the Manhattan Mobility Model (see Fig. 9(a)). However,the H-sensors consume more energy at low speeds than inthe Random Walk Mobility Model since the L-sensors do notchange directions until they reach an intersection. A largerTbackof f means a lower security level. Thus, there is a tradeoffbetween the security level and the energy consumption ofthe H-sensor. However, at high speeds, i.e., 16 m/s, Tbackof fshould be less than 1 second since the number of cluster keyupdates is minimal when Tbackof f is greater than 1 second.However, at low speeds, 1, 2 or 3 seconds are a reasonablechoice for the H-sensors.C. The Effect of TholdFig. 8(b) and Fig. 8(c) show the energy consumption of oneH-sensor and one L-sensor for a pairwise key establishmentin the Random Walk Mobility Model over the course of aday, respectively. The effect of Thold increases as the nodevelocity increases. As Thold increases, the energy consumptiondecreases because in the event that the L-sensors return to theold clusters before the timers expire, no new pairwise masterkey establishment is necessary. As shown in Table III theenergy differences caused by node velocity and Thold is due tothe differences in the frequency of pairwise key establishment.Such frequency is linearly proportional to velocity increases.When Thold ranges from 0 to 500 seconds, energy consumptionrapidly decreases because several moving nodes may return totheir previous clusters within the 500 seconds. However, whenThold ranges from 500 to 1,500 seconds, the energy consumptiondecreases more slowly since the probability of nodesreturning to their previous clusters is dramatically reduced.In the Manhattan Mobility Model, when Thold is small, moreenergy is consumed than in the Random Walk Mobility Modelduring pairwise key establishment since the L-sensors returnto their previous clusters with low frequency. (see Fig. 9(b) andFig. 9(c)). However, when Thold is large, the energy consumedfor the pairwise key establishment dramatically decreases.For instance, as shown in Table IV, when the node speedis 16 m/s and Thold is 1,000 seconds, the number of pairwisekey establishments is only 24,418 which is 5.4 times smaller382 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 2, FEBRUARY 2015TABLE IIITHE FREQUENCY OF PAIRWISE KEY ESTABLISHMENTS FORONE DAY IN RANDOM WALK MOBILITY MODELTABLE IVTHE FREQUENCY OF PAIRWISE KEY ESTABLISHMENTS FORONE DAY IN MANHATTAN MOBILITY MODELthan in the Random Walk Mobility Model with the samesettings.Similarly to Tbackof f , a larger Thold means a lower securitylevel. Thus, Thold should be selected according to VL, energyconsumption amount, and the desired security level. Resultsfrom Fig. 8(c) and Fig. 9(c) show that our scheme is practicalfor real-world monitoring applications such as in animaltracking or traffic monitoring. For example, L-sensors movingat 1 m/s in the Random Walk Mobility Model only use at most0.67 J per day if Thold is greater than 100 seconds. Also,L-sensors moving at 16 m/s in the Manhattan Mobility Modelonly use at most 1.16 J per day if Thold is greater than1,000 seconds. Considering that the average energy of oneC-size alkaline battery is 34,398 J [23], the energyconsumption of the L-sensor for a pairwise key establishmentis relatively small.IX. CONCLUSIONS AND FUTURE WORKSIn this paper, we propose the first certificateless effective keymanagement protocol (CL-EKM) for secure communication indynamic WSNs. CL-EKM supports efficient communicationfor key updates and management when a node leaves or joinsa cluster and hence ensures forward and backward key secrecy.Our scheme is resilient against node compromise, cloningand impersonation attacks and protects the data confidentialityand integrity. The experimental results demonstrate the efficiencyof CL-EKM in resource constrained WSNs. As futurework, we plan to formulate a mathematical model for energyconsumption, based on CL-EKM with various parametersrelated to node movements. This mathematical model will beutilized to estimate the proper value for the Thold and Tbackof fparameters based on the velocity and the desired tradeoffbetween the energy consumption and the security level.

Distortion-Aware Concurrent Multipath Transfer for Mobile Video Streaming in Heterogeneous Wireless N

The massive proliferation of wireless infrastructures with complementary characteristics prompts the bandwidth aggregation for Concurrent Multipath Transfer (CMT) over heterogeneous access networks. Stream Control Transmission Protocol (SCTP) is the standard transport-layer solution to enable CMT in multihomed communication environments. However, delivering high-quality streaming video with the existing CMT solutions still remains problematic due to the stringent quality of service (QoS) requirements and path asymmetry in heterogeneous wireless networks.

In this paper, we advance the state of the art by introducing video distortion into the decision process of multipath data transfer. The proposed distortion-aware concurrent multipath transfer (CMT-DA) solution includes three phases: 1) per-path status estimation and congestion control; 2) quality-optimal video flow rate allocation; 3) delay and loss controlled data retransmission. The term ‘flow rate allocation’ indicates dynamically picking appropriate access networks and assigning the transmission rates.

We analytically formulate the data distribution over multiple communication paths to minimize the end-to-end video distortion and derive the solution based on the utility maximization theory. The performance of the proposed CMT-DA is evaluated through extensive semi-physical emulations in Exata involving H.264 video streaming. Experimental results show that CMT-DA outperforms the reference schemes in terms of video peak signal-to-noise ratio (PSNR), good put, and inter-packet delay.

1.2 INTRODUCTION:

During the past few years, mobile video streaming service online gaming, etc. has become one of the “killer applications” and the video traffic headed for hand-held devices has experienced explosive growth. The latest market research conducted by Cisco Company indicates that video streaming accounts for 53 percent of the mobile Internet traffic in parallel, global mobile data is expected to increase 11-fold in the next five years. Another ongoing trend feeding this tremendous growth is the popularity of powerful mobile terminals (e.g., smart phones and iPad), which facilitates individual users to access the Internet and watch videos from everywhere [4].

Despite the rapid advancements in network infrastructures, it is still challenging to deliver high-quality streaming video over wireless platforms. On one hand, the Wi-Fi networks are limited in radio coverage and mobility support for individual users; On the other hand, the cellular networks can well sustain the user mobility but their bandwidth is often inadequate to support the throughput-demanding video applications. Although the 4 G LTE and WiMAX can provide higher peak data rate and extended coverage, the available capacity will still be insufficient compared to the ever-growing video data traffic.

The complementary characteristics of heterogeneous access networks prompt the bandwidth aggregation for concurrent multipath transfer (CMT) to enhance transmission throughput and reliability (see Fig. 1). With the emergency of multihomed/multinetwork terminals CMT is considered to be a promising solution for supporting video streaming in future wireless networking. The key research issue in multihomed video delivery over heterogeneous wireless networks must be effective integration of the limited channel resources available for providing adequate quality of service (QoS). Stream control transmission protocol (SCTP) is the standard transport-layer solution that exploits the multihoming feature to concurrently distribute data across multiple independent end-to-end paths.

Therefore, many CMT solutions have been proposed to optimize the delay, throughput, or reliability performance for efficient data delivery. However, due to the special characteristics of streaming video, these network-level criteria cannot always improve the perceived media quality. For instance, a real-time video application encoded in constant bit rate (CBR) may not effectively leverage the throughput gains since its streaming rate is typically fixed or bounded by the encoding schemes. In addition, involving a communication path with available bandwidth but long delay in the multipath video delivery may degrade the streaming video quality as the end-to-end distortion increases. Consequently, leveraging the CMT for high-quality streaming video over heterogeneous wireless networks is largely unexplored.

In this paper, we investigate the problem by introducing video distortion into the decision process of multipath data transfer over heterogeneous wireless networks. The proposed Distortion-Aware Concurrent Multipath Transfer (CMT-DA) solution is a transport-layer protocol and includes three phases: 1) per-path status estimation and congestion control to exploit the available channel resources; 2) data flow rate allocation to minimize the end-to-end video distortion; 3) delay and loss constrained data retransmission for bandwidth conservation. The detailed descriptions of the proposed solution will be presented in Section 4. Specifically, the contributions of this paper can be summarized in the following.

_ An effective CMT solution that uses path status estimation, flow rate allocation, and retransmission control to optimize the real-time video quality in integrated heterogeneous wireless networks.

_ A mathematical formulation of video data distribution over parallel communication paths to minimize the end-to-end distortion. The utility maximization theory is employed to derive the solution for optimal transmission rate assignment extensive semi-physical emulations in Exata involving real-time H.264 video streaming.

1.3 LITRATURE SURVEY:

CMT-QA: QUALITY-AWARE ADAPTIVE CONCURRENT MULTIPATH DATA TRANSFER IN HETEROGENEOUS WIRELESS NETWORKS

AUTHOR: C. Xu, T. Liu, J. Guan, H. Zhang, and G. M. Muntean,

PUBLICATION: IEEE Trans. Mobile Comput., vol. 12, no. 11, pp. 2193–2205, Nov. 2013.

EXPLANATION:

Mobile devices equipped with multiple network interfaces can increase their throughput by making use of parallel transmissions over multiple paths and bandwidth aggregation, enabled by the stream control transport protocol (SCTP). However, the different bandwidth and delay of the multiple paths will determine data to be received out of order and in the absence of related mechanisms to correct this, serious application-level performance degradations will occur. This paper proposes a novel quality-aware adaptive concurrent multipath transfer solution (CMT-QA) that utilizes SCTP for FTP-like data transmission and real-time video delivery in wireless heterogeneous networks. CMT-QA monitors and analyses regularly each path’s data handling capability and makes data delivery adaptation decisions to select the qualified paths for concurrent data transfer. CMT-QA includes a series of mechanisms to distribute data chunks over multiple paths intelligently and control the data traffic rate of each path independently. CMT-QA’s goal is to mitigate the out-of-order data reception by reducing the reordering delay and unnecessary fast retransmissions. CMT-QA can effectively differentiate between different types of packet loss to avoid unreasonable congestion window adjustments for retransmissions. Simulations show how CMT-QA outperforms existing solutions in terms of performance and quality of service.

PERFORMANCE ANALYSIS OF PROBABILISTIC MULTIPATH TRANSMISSION OF VIDEO STREAMING TRAFFIC OVER MULTI-RADIO WIRELESS DEVICES

AUTHOR: W. Song and W. Zhuang

PUBLICATION: IEEE Trans. Wireless Commun., vol. 11, no. 4, pp. 1554–1564, 2012.

EXPLANATION:

Popular smart wireless devices become equipped with multiple radio interfaces. Multihoming support can be enabled to allow for multiple simultaneous associations with heterogeneous networks. In this study, we focus on video streaming traffic and propose analytical approaches to evaluate the packet-level and call-level performance of a multipath transmission scheme, which sends video traffic bursts over multiple available channels in a probabilistic manner. A probability generation function (PGF) and z-transform method is applied to derive the PGF of packet delay and any arbitrary moment in general. Particularly, we can obtain the average delay, delay jitter, and delay outage probability. The essential characteristics of video traffic are taken into account, such as deterministic burst intervals, highly dynamic burst length, and batch arrivals of transmission packets. The video substream traffic resulting from the probabilistic flow splitting is characterized by means of zero-inflated models. Further, the call-level performance, in terms of flow blocking probability and system throughput, is evaluated with a three-dimensional Markov process and compared with that of an always-best access selection. The numerical and simulations results demonstrate the effectiveness of our analysis framework and the performance gain of multipath transmission.

AN END-TO-END VIRTUAL PATH CONSTRUCTION SYSTEM FOR STABLE LIVE VIDEO STREAMING OVER HETEROGENEOUS WIRELESS NETWORKS

AUTHOR: S. Han, H. Joo, D. Lee, and H. Song

PUBLICATION: IEEE J. Sel. Areas Commun., vol. 29, no. 5, pp. 1032–1041, May 2011.

EXPLANATION:

In this paper, we propose an effective end-to-end virtual path construction system, which exploits path diversity over heterogeneous wireless networks. The goal of the proposed system is to provide a high quality live video streaming service over heterogeneous wireless networks. First, we propose a packetization-aware fountain code to integrate multiple physical paths efficiently and increase the fountain decoding probability over wireless packet switching networks. Second, we present a simple but effective physical path selection algorithm to maximize the effective video encoding rate while satisfying delay and fountain decoding failure rate constraints. The proposed system is fully implemented in software and examined over real WLAN and HSDPA networks.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing method an effective approach in designing error-resilient wireless video broadcasting systems in recent years, Joint source-channel coding (JSCC) attracts increasing interests in both research community and industry because it shows better results in robust layered video transmission over error-prone channels of various techniques available during these years may be found. However, there are still many open problems in terms of how to serve heterogeneous users with diverse screen features and variable reception performances in wireless video broadcast system. One particular challenging problem of this heterogeneous quality-of-service (QoS) video provision is: the users would prefer flexible video with low quality to match their screens, at the same time; the video stream could be reliable received.

The main technical difficulties are as follows:

  • A distinctive characteristic in current wireless broadcast system is that the receivers are highly heterogeneous in terms of their terminal processing capabilities and available bandwidths. In source side, scalable video coding (SVC) has been proposed to provide an attractive solution to this problem.
  • However, in order to support flexible video broadcasting, the scalable video sources need to provide adaptation ability through a variety of schemes, such as scalable video stream extraction layer generation with different priority and summarization before they can be transmitted over the error-prone networks.


2.1.1 DISADVANTAGES:

  • Existing layered video data is very sensitive to transmission failures, the transmission must be more reliable, have low overhead and support large numbers of devices with heterogeneous characteristics. In broadcast and multicast networks, conventional schemes such as adaptive retransmission have their limitations, for example, retransmission may lead to implosion problem.
  • Forward error correction (FEC) and unequal error protection (UEP) are employed to provide the QoS support for video transmission. However, in order to obtain as minimum investment as possible in broadcasting system deployment, server-side must be designed more scalable, reliable, independent, and support vast number of autonomous receivers. Suitable FEC approaches are expected such that can eliminate the retransmission and lower the unnecessary receptions overhead at each receiver-side.
  • Conventionally, the joint source and channel coding are designed with seldom consideration in heterogeneous characteristics, and most of the above challenges are ignored in practical video broadcasting system. This leads to the need for heterogeneous QoS video provision in broadcasting network. This paper presents the point of view to study the hybrid-scalable video from new quality metric so as to support users’ heterogeneous requirements.


2.2 PROPOSED SYSTEM:

We proposed Distortion-Aware Concurrent Multipath Transfer (CMT-DA) solution is a transport-layer protocol and includes three phases: 1) per-path status estimation and congestion control to exploit the available channel resources; 2) data flow rate allocation to minimize the end-to-end video distortion; 3) delay and loss constrained data retransmission for bandwidth conservation an effective CMT solution that uses path status estimation, flow rate allocation, and retransmission control to optimize the real-time video quality in integrated heterogeneous wireless networks.

We propose a quality-aware adaptive concurrent multipath transfer (CMT-QA) scheme that distributes the data based on estimated path quality. Although the path status is an important factor that affects the scheduling policy, the application requirements should also be considered to guarantee the QoS. Basically, the proposed CMT-DA is different from the CMT-QA as we take the video distortion as the benchmark. Still, the proposed solutions (path status estimation, flow rate allocation, and retransmission control) in CMT-DA are significantly different from those in CMTQA. In another research conducted by a realistic evaluation tool-set is proposed to analyze and optimize the performance of multimedia distribution when taking advantage of CMT-based multihoming SCTP solutions.

2.2.1 ADVANTAGES:

  • We propose a novel out-of-order scheduling approach for in-order arriving of the data chunks in CMT-DA based on the progressive water-filling algorithm. Heterogeneous wireless networks based on fountain code. The encoded multipath streaming model proposed by Chow et al. is a joint multipath and FEC approach for real time live streaming applications.
  • We propose an end-to-end virtual path construction system that exploits the path diversity in heterogeneous wireless networks based on fountain code. The encoded multipath streaming model proposed by Chow et al. is a joint multipath and FEC approach for real time live streaming applications. The authors provide asymptotic analysis and derive closed-form solution for the FEC packets allocation.
  • The major components at the sender side are the parameter control unit, flow rate allocator, and retransmission controller. The parameter control unit is responsible for processing the acknowledgements (ACKs) feedback from the receiver, estimating the path status and adapting the congestion window size. The delay and loss requirements are imposed by the video applications to achieve the target video quality.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP or Win7
  • Front End                                :           JAVA JDK 1.7
  • Tools                                       :           Netbeans or Eclipse
  • Script                                       :           Java Script
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.


3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

4.3 MODULE DESCRIPTION:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company.  For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are 

  • ECONOMICAL FEASIBILITY
  • TECHNICAL FEASIBILITY
  • SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:     

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

 

5.1.2 TECHNICAL FEASIBILITY   

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.  

5.1.3 SOCIAL FEASIBILITY:  

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description Expected result
Test for application window properties. All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations. All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program  testing  checks  for  two  types  of  errors:  syntax  and  logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem.  The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description Expected result
Test for all modules. All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group. The result after execution should give the accurate result.


5.1. 3 NON-FUNCTIONAL TESTING:

 The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

  • Load testing
  • Performance testing
  • Usability testing
  • Reliability testing
  • Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. Should designate another active node as a Server.


5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.   Should handle large input values, and produce accurate result in a  expected time.  


5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. In case of failure of  the server an alternate server should take over the job.


5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

  Description Expected result
Checking that the user identification is authenticated. In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers. The peers should know group key in the same group.


5.1.8 WHITE BOX TESTING:

White  box  testing,  sometimes called  glass-box  testing is  a test  case  design method  that  uses  the  control  structure  of the procedural  design  to  derive  test  cases. Using  white  box  testing  method,  the software  engineer  can  derive  test  cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description Expected result
Exercise all logical decisions on their true and false sides. All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds. All the loops must be finite.
Exercise internal data structures to ensure their validity. All the data structures must be valid.


5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software.  That  is,  black  testing  enables  the software engineer  to  derive  sets  of  input  conditions  that  will  fully  exercise  all  functional requirements  for  a  program.  Black box testing is not alternative to white box techniques.  Rather  it  is  a  complementary  approach  that  is  likely  to  uncover  a different  class  of  errors  than  white box  methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description Expected result
To check for incorrect or missing functions. All the functions must be valid.
To check for interface errors. The entire interface must function normally.
To check for errors in a data structures or external data base access. The database updation and retrieval must be done.
To check for initialization and termination errors. All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

 

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

 

The Java Programming Language

 

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

  • Simple
    • Architecture neutral
    • Object oriented
    • Portable
    • Distributed     
    • High performance
    • Interpreted     
    • Multithreaded
    • Robust
    • Dynamic
    • Secure     

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

  • The Java Virtual Machine (Java VM)
  • The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

  • The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
  • Applets: The set of conventions used by applets.
  • Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
  • Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
  • Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
  • Software components: Known as JavaBeansTM, can plug into existing component architectures.
  • Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
  • Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

 

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

  • Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
  • Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
  • Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
  • Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
  • Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
  • Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
  • Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

 

6.5 ODBC:

 

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

 

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

  1. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

  • Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

  • Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

  • Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple                                     Architecture-neutral

Object-oriented                       Portable

Distributed                              High-performance

Interpreted                              Multithreaded

Robust                                     Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

 

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

 

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.1 CONCLUSION:

The future wireless environment is expected to be a converged system that incorporates different access networks with diverse transmission features and capabilities. The increasing powerfulness and popularity of multihomed mobile terminals facilitate the bandwidth aggregation for enhanced transmission reliability and data throughput. Optimizing the SCTP is a critical step towards integrating heterogeneous wireless networks for efficient video delivery.

This paper proposes a novel distortion-aware concurrent multipath transfer scheme to support high-quality video streaming over heterogeneous wireless networks. Through modeling and analysis, we have developed solutions for per-path status estimation, congestion window adaption, flow rate allocation, and data retransmission. As future work, we will study the cost minimization problem of utilizing CMT for mobile video delivery in heterogeneous wireless networks.