Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revocation

05/08/201902/07/2019 by admin

REAL-TIME BIG DATA ANALYTICAL ARCHITECTURE FOR REMOTE

SENSING APPLICATION

ABSTRACT:

In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.

The proposed architecture comprises three main units:

1) Remote sensing Big Data acquisition unit (RSDU);

2) Data processing unit (DPU); and

3) Data analysis decision unit (DADU).

First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU.

INTRODUCTION:

EXISTING SYSTEM:

Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.

The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.

In the case of simple algorithms, this is quite easy: the input is just split into different non-overlapping pieces that are processed one by one. But most algorithms do consider the neighborhood of each pixel. As a consequence, in most cases, the data will have to be split into partially overlapping pieces. The objective is to obtain the same result as the original algorithm as if the processing was done in one go. Depending on the algorithm, this is unfortunately not always possible.

DISADVANTAGES:

A reader that loads the image, or part of the image in memory from the file on disk;

A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;

A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;

A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.

PROPOSED SYSTEM:

We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.

On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.

We proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system. Furthermore, the proposed architecture has the capability of storing incoming raw data to perform offline analysis on largely stored dumps, when required. Finally, a detailed analysis of remotely sensed earth observatory Big Data for land and sea area are provided using .NET. In addition, various algorithms are proposed for each level of RSDU, DPU, and DADU to detect land as well as sea area to elaborate the working of architecture.

ADVANTAGES:

Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.

Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.

The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.

We have used the proposed architecture to perform analysis and proposed an algorithm for handling, processing, analyzing, and decision-making for remote sensing Big Data images using our proposed architecture.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

ARCHITECTURE DIAGRAM

MODULES:

DATA ANALYSIS DECISION UNIT (DADU):

DATA PROCESSING UNIT (DPU):

REMOTE SENSING APPLICATION RSDU:

FINDINGS AND DISCUSSION:

ALGORITHM DESIGN AND TESTING:

MODULES DESCRIPTION:

DATA PROCESSING UNIT (DPU):

In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.

Each processing server has its algorithm implementation for processing incoming segment of data from FLBS. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results generated by each server are then sent to the aggregation server for compilation, organization, and storing for further processing.

DATA ANALYSIS DECISION UNIT (DADU):

DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.

The aggregation server also sends the same copy of that result to the decision-making server to process that result for making decision. The decision-making server is supported by the decision algorithms, which inquire different things from the result, and then make various decisions (e.g., in our analysis, we analyze land, sea, and ice, whereas other finding such as fire, storms, Tsunami, earthquake can also be found). The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant since any small error in decision-making can degrade the efficiency of the whole analysis. DADU finally displays or broadcasts the decisions, so that any application can utilize those decisions at real time to make their development. The applications can be any business software, general purpose community software, or other social networks that need those findings (i.e., decision-making).

REMOTE SENSING APPLICATION RSDU:

Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.

However, to make the raw data into image format, the remote sensing satellite uses effective data analysis, remote sensing satellite preprocesses data under many situations to integrate the data from different sources, which not only decreases storage cost, but also improves analysis accuracy. The data must be corrected in different methods to remove distortions caused due to the motion of the platform relative to the earth, platform attitude, earth curvature, nonuniformity of illumination, variations in sensor characteristics, etc. The data is then transmitted to Earth Base Station for further processing using direct communication link. We divided the data processing procedure into two steps, such as real-time Big Data processing and offline Big Data processing. In the case of offline data processing, the Earth Base Station transmits the data to the data center for storage. This data is then used for future analyses. However, in real-time data processing, the data are directly transmitted to the filtration and load balancer server (FLBS), since storing of incoming real-time data degrades the performance of real-time processing.

FINDINGS AND DISCUSSION:

Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.

1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.

2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.

3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.

4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.

5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.

The MPH and SPH data are in ASCII format, whereas all the other data sets are in binary format. MDS, ADS, and GADs consist of the sequence of records and one or more fields of the data for each record. In our case, the MDS contains number of records, and each record contains a number of fields. Each record of the MDS corresponds to one row of the satellite image, which is our main focus during analysis.

ALGORITHM DESIGN AND TESTING:

Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.

IMPLEMENTATION:

Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.

Healthcare scenarios, medical practitioners gather massive volume of data about patients, medical history, medications, and other details. The above-mentioned data are accumulated in drug-manufacturing companies. The nature of these data is very complex, and sometimes the practitioners are unable to show a relationship with other information, which results in missing of important information. With a view in employing advance analytic techniques for organizing and extracting useful information from Big Data results in personalized medication, the advance Big Data analytic techniques give insight into hereditarily causes of the disease.

ALGORITHMS:

This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.

The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.

ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.

CONCLUSION AND FUTURE:

In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.

The algorithms proposed in this paper for each unit and subunits are used to analyze remote sensing data sets, which helps in better understanding of land and sea area. The proposed architecture welcomes researchers and organizations for any type of remote sensory Big Data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible for Big Data analysis for all applications, e.g., sensors and social networking. We are also planning to use the proposed architecture to perform complex analysis on earth observatory data for decision making at realtime, such as earthquake prediction, Tsunami prediction, fire detection, etc.

REFERENCES:

[1] D. Agrawal, S. Das, and A. E. Abbadi, “Big Data and cloud computing: Current state and future opportunities,” in Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.

[2] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for Big Data,” PVLDB, vol. 2, no. 2, pp. 1481–1492, 2009.

[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[4] H. Herodotou et al., “Starfish: A self-tuning system for Big Data analytics,” in Proc. 5th Int. Conf. Innovative Data Syst. Res. (CIDR), 2011, pp. 261–272.

[5] K. Michael and K. W. Miller, “Big Data: New opportunities and new challenges [guest editors’ introduction],” IEEE Comput., vol. 46, no. 6, pp. 22–24, Jun. 2013.

[6] C. Eaton, D. Deroos, T. Deutsch, G. Lapis, and P. C. Zikopoulos, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York, NY, USA: Mc Graw-Hill, 2012.

[7] R. D. Schneider, Hadoop for Dummies Special Edition. Hoboken, NJ, USA: Wiley, 2012.

Proof of Ownership In Deduplicated Storage With Mobile Device Efficiency

05/08/201902/07/2019 by admin

Cloud storage such as Dropbox and Bitcasa is one of the most popular cloud services. Currently, with the prevalence of mobile cloud computing, users can even collaboratively edit the newest version of documents and synchronize the newest files on their smart mobile devices. A remarkable feature of current cloud storage is its virtually infinite storage. To support unlimited storage, the cloud storage provider uses data deduplication techniques to reduce the data to be stored and therefore reduce the storage expense. Moreover, the use of data deduplication also helps significantly reduce the need for bandwidth and therefore improve the user experience. Nevertheless, in spite of the above benefits, data deduplication has its inherent security weaknesses. Among them, the most severe is that the adversary may have an unauthorized file downloading via the file hash only. In this article we first review the previous solutions and identify their performance weaknesses. Then we propose an alternative design that achieves cloud server efficiency and especially mobile device efficiency.

1.2 INTRODUCTION

Mobile devices have become prevalent in recent years, and mobile computing has been a growing trend. Meanwhile, cloud computing is definitely the biggest revolution in recent decades. Many tasks, such as document editing and file backup, have been shifted from end devices to the cloud. Therefore, with the convergence of mobile computing and cloud computing, along with the recent development of the 5G communication standard that establishes more reliable and faster communication channels, mobile cloud computing (MCC) could be a rapidly growing field that deserves to be investigated and explored.

Deduplicated Storage in Mobile Cloud Computing Among cloud services, cloud storage with the capability of file backup and synchronization could be the most popular service that enables mobile users to access their files everywhere. Dropbox (https://www.dropbox.com/) and Bitcasa (https://www.bitcasa.com/) are two examples that offer easy-to-use file backup and synchronization services. Several remarkable features of such cloud storage can be identified. It has high availability, which means that the user’s data will be replicated over cloud servers worldwide and is guaranteed to be accessible whenever the user has the need. It has the flexibility in a pay-as-you-go model, which means that the user can gain additional storage immediately whenever the user is willing to make an extra payment. The most important feature is that it has virtually infinite storage space, which means that the user can backup whatever he/she wants to be uploaded to the cloud. A renowned example is Bitcasa, which offers “unlimited storage” that enables the user to upload virtually everything. Offering infinite storage space might cause a severe economic burden on the cloud storage provider.

However, a technique called data deduplication helps significantly reduce the cost of storage. Data deduplication has been widely implemented by cloud storage providers including Dropbox and Bitcasa. According to the report in [8] (http://www.snia.org), the use of data deduplication in business applications may reduce the data to be stored and thus achieve disk and bandwidth savings of more than 90 percent. The power of data deduplication is achieved by avoiding storing the same file multiple times. The storage saving is more obvious especially when the popular multimedia contents such as music and movies are considered. The replicated contents create an additional storage need the first time they are uploaded, but create no extra storage need for subsequent uploads. In addition to storage saving, if the data content has been in the storage, then the replicated content has no need to be transmitted, achieving bandwidth saving. Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not.

We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a. Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

1.3 LITRATURE SURVEY

DUPLESS: SERVERAIDED ENCRYPTION FOR DEDUPLICATED STORAGE

AUTHOR: M. Bellare, S. Keelveedhi, and T. Ristenpart

PUBLISH: Proc. 22nd USENIX Conf. Sec. Symp., 2013, pp. 179–194.

EXPLANATION:

Cloud storage service providers such as Dropbox, Mozy, and others perform deduplication to save space by only storing one copy of each file uploaded. Should clients conventionally encrypt their files, however, savings are lost. Message-locked encryption (the most prominent manifestation of which is convergent encryption) resolves this tension. However it is inherently subject to brute-force attacks that can recover files falling into a known set. We propose an architecture that provides secure deduplicated storage resisting brute-force attacks, and realize it in a system called DupLESS. In DupLESS, clients encrypt under message-based keys obtained from a key-server via an oblivious PRF protocol. It enables clients to store encrypted data with an existing service, have the service perform deduplication on their behalf, and yet achieves strong confidentiality guarantees. We show that encryption for deduplicated storage can achieve performance and space savings close to that of using the storage service with plaintext data.

FAST AND SECURE LAPTOP BACKUPS WITH ENCRYPTED DE-DUPLICATION

AUTHOR: P. Anderson and L. Zhang

PUBLISH: Proc. 24th Int. Conf. Large Installation Syst. Admin., 2010, pp. 29–40.

EXPLANATION:

Many people now store large quantities of personal and corporate data on laptops or home computers. These often have poor or intermittent connectivity, and are vulnerable to theft or hardware failure. Conventional backup solutions are not well suited to this environment, and backup regimes are frequently inadequate. This paper describes an algorithm which takes advantage of the data which is common between users to increase the speed of backups, and reduce the storage requirements. This algorithm supports client-end per-user encryption which is necessary for confidential personal data. It also supports a unique feature which allows immediate detection of common subtrees, avoiding the need to query the backup system for every file. We describe a prototype implementation of this algorithm for Apple OS X, and present an analysis of the potential effectiveness, using real data obtained from a set of typical users. Finally, we discuss the use of this prototype in conjunction with remote cloud storage, and present an analysis of the typical cost savings.

SECURE DEDUPLICATION WITH EFFICIENT AND RELIABLE CONVERGENT KEY MANAGEMENT

AUTHOR: J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou

PUBLISH: IEEE Trans. Parallel Distrib. Syst., http://oi.ieeecomputersociety.org/10.1109/TPDS.2013.284, 2013

EXPLANATION:

Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Data de duplication is one of important data compression techniques for eliminating duplicate copies of repeating data, and has been widely used in cloud storage to reduce the amount of storage space and save bandwidth. Previous de duplication systems cannot support differential authorization duplicate check, which is important in many applications. In such an authorized de duplication system, each user is issued a set of privileges during system initialization Each file uploaded to the cloud is also bounded by a set of privileges to specify which kind of users is allowed to perform the duplicate check and access the files.

Before submitting his duplicate check request for a file, the user needs to take this file and his own privileges as inputs. The user is able to find a duplicate f or this file if and only if there is a copy of this file and a matched privilege stored in cloud. Traditional de duplication systems based on convergent encryption, although providing confidentiality to some extent; do not support the duplicate check with differential privileges. In other words, no differential privileges have been considered in the de duplication based on convergent encryption technique. It seems to be contradicted if we want to realize both de duplication and differential authorization duplicate check at the same time.

2.1.1 DISADVANTAGES:

De duplication systems cannot support differential authorization duplicate check.
One critical challenge of cloud storage services is the management of the ever increasing volume of data.
Users’ sensitive data are susceptible to both insider and outsider attacks.
Sometimes de duplication impossible.

2.2 PROPOSED SYSTEM:

We propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

2.2.1 ADVANTAGES:

POW scheme such as the bandwidth requirement, I/O overhead at both user and server sides, and the computation overhead at both sides concern the performance, the second is less known in the POW design. More specifically, cloud storage usually has a storage hierarchy: the memory (primary storage) and disk (secondary storage).

The execution of a POW scheme might require the user and cloud to access the file stored in the disk multiple times. The server might also need to keep the verification object in either the memory or the disk to verify the user’s claim.

The above all might result in a huge amount of I/O delay because of the access time gap between the memory and disk. In this article we focus only on the abuse of a file hash to gain the ownership of the file and aim to design a POW scheme with minimum performance overhead.

To prevent unauthorized access, a secure proof of ownership (POW) protocol is also needed to provide the proof that the user indeed owns the same file when a duplicate is found.
It makes overhead to minimal compared to the normal convergent encryption and file upload operations.
Data confidentiality is maintained.
Secure compared to proposed techniques

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

USER:

ADMIN:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

SENDER USER:

RECEIVER USER:

3.5 ACTIVITY DIAGRAM:

SENDER LOGIN:

RECEIVER LOGIN:

CHAPTER 4

4.0 IMPLEMENTATION:

MOBILE CLOUD COMPUTING:

Mobile Cloud Computing (MCC) is the combination of cloud computing, mobile computing and wireless networks to bring rich computational resources to mobile users, network operators, as well as cloud computing providers. The ultimate goal of MCC is to enable execution of rich mobile applications on a plethora of mobile devices, with a rich user experience. MCC provides business opportunities for mobile network operators as well as cloud providers. “A rich mobile computing technology that leverages uniﬁed elastic resources of varied clouds and network technologies toward unrestricted functionality, storage, and mobility to serve a multitude of mobile devices anywhere, anytime through the channel of Ethernet or Internet regardless of heterogeneous environments and platforms based on the pay-as-you-use principle.

ARCHITECTURE:

MCC uses computational augmentation approachesby which resource-constraint mobile devices can utilize computational resources of varied cloud-based resources. In MCC, there are four types of cloud-based resources, namely distant immobile clouds, proximate immobile computing entities, proximate mobile computing entities, and hybrid (combination of the other three models). Giant clouds such as Amazon EC2 are in the distant immobile groups whereas cloudlet or surrogates are member of proximate immobile computing entities. Smartphones, tablets, handheld devices, and wearable computing devices are part of the third group of cloud-based resources which is proximate mobile computing entities.

DIAGRAM:

In the MCC landscape, an amalgam of mobile computing, cloud computing, and communication networks (to augment smartphones) creates several complex challenges such as Mobile Computation Offloading, Seamless Connectivity, Long WAN Latency, Mobility Management, Context-Processing, Energy Constraint, Vendor/data Lock-in, Security and Privacy, Elasticity that hinder MCC success and adoption.

Although significant research and development in MCC is available in the literature, efforts in the following domains are still lacking:

Architectural issues: Reference architecture for heterogeneous MCC environment is a crucial requirement for unleashing the power of mobile computing towards unrestricted ubiquitous computing.
Energy-efficient transmission: MCC requires frequent transmissions between cloud platform and mobile devices, due to the stochastic nature of wireless networks, the transmission protocol should be carefully designed.
Context-awareness issues: Context-aware and socially-aware computing are inseparable traits of contemporary handheld computers. To achieve the vision of mobile computing among heterogeneous converged networks and computing devices, designing resource-efﬁcient environment-aware applications is an essential need.
Live VM migration issues: Executing resource-intensive mobile application via Virtual Machine (VM) migration-based application ofﬂoading involves encapsulation of application in VM instance and migrating it to the cloud, which is a challenging task due to additional overhead of deploying and managing VM on mobile devices.
Mobile communication congestion issues: Mobile data trafﬁc is tremendously hiking by ever increasing mobile user demands for exploiting cloud resources which impact on mobile network operators and demand future efforts to enable smooth communication between mobile and cloud endpoints.
Trust, security, and privacy issues: Trust is an essential factor for the success of the burgeoning MCC paradigm.

PROOF OF OWNERSHIP:

An even more severe and direct security threat from using deduplicated cloud storage is that the adversary may gain the ownership of files by only eavesdropping on file hashes. A closer look at client side deduplication can find that anyone in possession of the file hash can gain ownership of the file by uploading the file hash. More specifically, the cloud considers receiving a store request for a file already in the storage, avoids the redundant file transmission, and then adds the user as an additional owner of the file. An illustrative example is shown in Fig. 3d. Such a situation is apparently undesirable because in theory the adversary cannot infer the file content via the hash.

However, in this case, once the adversary knows the hash, it is able to download the entire file content. On the other hand, in practice, the user considers the hash unharmful and in some cases publishes the hashes as timestamps. However, the publicly available hashes can be abused to gain the file. This security weakness comes from using the static and short piece of information (hash) as a way of claiming file ownership. Motivated by this observation, Halevi et al. [10] introduce the notion of proof of ownership (POW). A POW scheme is jointly executed by the cloud and user such that the user can prove to the cloud that it is indeed in possession of the file.

4.1 ALGORITHM:

PUBLIC KEY INFRASTRUCTURE (PKI) AND PRIVATE KEY GENERATOR (PKG):

In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which is based on the Diffie–Hellman key exchange. It was described by Taher Elgamal in 1985. ElGamal encryption is used in the free GNU Privacy Guard software, recent versions of PGP, and other cryptosystems. The DSA (Digital Signature Algorithm) is a variant of the ElGamal signature scheme, which should not be confused with ElGamal encryption. The security of the ElGamal scheme depends on the properties of the underlying group as well as any padding scheme used on the messages.

If the computational Diffie–Hellman assumption (CDH) holds in the underlying cyclic group , then the encryption function is one-way. If the decisional Diffie–Hellman assumption (DDH) holds in , then ElGamal achieves semantic security. Semantic security is not implied by the computational Diffie–Hellman assumption alone. See decisional Diffie–Hellman assumption for a discussion of groups where the assumption is believed to hold.

To achieve chosen-ciphertext security, the scheme must be further modified, or an appropriate padding scheme must be used. Depending on the modification, the DDH assumption may or may not be necessary.

Other schemes related to ElGamal which achieve security against chosen ciphertext attacks have also been proposed. The Cramer–Shoup cryptosystem is secure under chosen ciphertext attack assuming DDH holds for. Its proof does not use the random oracle model. Another proposed scheme is DHAES whose proof requires an assumption that is weaker than the DDH assumption.

4.2 MODULES:

SECURE USER MODULES:

DEDUPLICATED STORAGE:

CHECK DEDUPLICATES:

APPLY POW SCHEME:

SECURE SEND KEY:

4.3 MODULE DESCRIPTION:

SECURE USER MODULES:

In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first.

Registration
File View
Encryption
Download
Upload Files
Encrypt and save to cloud

DEDUPLICATED STORAGE:

Client side deduplication incurs its own security weaknesses. First, the privacy of the file existence in the cloud may be compromised because the adversary may try to upload the candidate files to see whether the deduplication takes place. If the deduplication takes place, this will be an indica tor of the file’s existence. Otherwise, the adversary may infer the file’s nonexistence. The situation becomes even worse when we consider the low-entropy files because the adversary may exhaustively create different files and upload the hashes to check the file’s existence. For example, a curious colleague may query his/her manager’s salary by uploading different salary sheets because the sheets are of a similar form, restricting the number of file contents to be tested.

CHECK DEDUPLICATES:

Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not. We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a.

Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

APPLY POW SCHEME:

The POW schemes in performance very well on the server side since only a small size index (tree root) needs to be stored in the main memory. However, the proof of ownership is achieved by the user sending an authentication path of size O(log |f|) to the cloud, resulting in more communication overhead and computation load on the cloud. The I/O overhead of the user side is also increased, compared to the POW schemes in, because the user needs to retrieve the entire file. On the other extreme, although the s-POW schemes in have great computation and I/O efficiency in the user side, its I/O burden on the cloud is significantly increased since the cloud is required to retrieve random bits from the secondary storage.

In this article we propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

SECURE SEND KEY:

Once the key request was received, the sender can send the key or he can decline it. With this key and request id which was generated at the time of sending key request the receiver can decrypt the message.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

We propose an alternative POW design on the problem of unauthorized file downloading in deduplicated cloud storage. In our design, the use of probabilistic data structure, the Bloom filter, primarily contributes to the overhead reduction. Since the Bloom filter has been used widely in various applications and is easy to be implemented, our proposed POW scheme is considered realistic and can be deployed in real-world cloud storage services. Despite the use of the Bloom filter in reducing the I/O needs, the size of the Bloom filter may grow with the number of files stored in the cloud. The Bloom filter may also be of a huge size so that it needs to be partitioned and part of it needs to be stored in the disk. Thus, one possible future research focus is to develop a more succinct data structure or devise a new index mechanism such that the index (the Bloom filter in this article) can be fit into the memory even in the case of a huge number of files in the cloud.

Privacy-Preserving Detection of Sensitive Data Exposure

05/08/201902/07/2019 by admin

An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.

Finally, the prefetched data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed file systems for cloud environments to achieve better I/O performance. In particular, configurationlimited client machines in the cloud are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.

1.2 INTRODUCTION

The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.

However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.

On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.

Furthermore, considering only disk I/O events can reveal the disk tracks that can offer critical information to perform I/O optimization tactics certain prefetching techniques have been proposed in succession to read the data on the disk in advance after analyzing disk I/O traces. But, this kind of prefetching only works for local file systems, and the prefetched data iscached on the local machine to fulfill the application’s I/O requests passively in brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers.

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files’ metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

The file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of data intensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications benchmark to create OLTP workloads, since it is able to create similar OLTP workloads that exist in real systems. All the configured client file systems executed the same script, and each of them run several threads that issue OLTP requests. Because Sysbench requires MySQL installed as a backend for OLTP workloads, we configured mysqld process to 16 cores of storage servers. As a consequence, it is possible to measure the response time to the client request while handling the generated workloads.

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.

This paper makes the following two contributions:

1) Chaotic time series prediction and linear regression prediction to forecast disk I/O access. We have modeled the disk I/O access operations, and classified them into two kinds of access patterns, i.e. the random access pattern and the sequential access pattern. Therefore, in order to predict the future I/O access that belongs to the different access patterns as accurately as possible (note that the future I/O access indicates what data will be requested in the near future), two prediction algorithms including the chaotic time series prediction algorithm and the linear regression prediction algorithm have been proposed respectively. 2) Initiative data prefetching on storage servers. Without any intervention from client file systems except for piggybacking their information onto relevant I/O requests to the storage servers. The storage servers are supposed to log disk I/O access and classify access patterns after modeling disk I/O events. Next, by properly using two proposed prediction algorithms, the storage servers can predict the future disk I/O access to guide prefetching data. Finally, the storage servers proactively forward the prefetched data to the relevant client file systems for satisfying future application’s requests.

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

We have proposed, implemented and evaluated an initiative data prefetching approach on the storage servers for distributed file systems, which can be employed as a backend storage system in a cloud environment that may have certain resource-limited client machines. To be specific, the storage servers are capable of predicting future disk I/O access to guide fetching data in advance after analyzing the existing logs, and then they proactively push the prefetched data to relevant client file systems for satisfying future applications’ requests.

Purpose of effectively modeling disk I/O access patterns and accurately forwarding the prefetched data, the information about client file systems is piggybacked onto relevant I/O requests, and then transferred from client nodes to corresponding storage server nodes. Therefore, the client file systems running on the client nodes neither log I/O events nor conduct I/O access prediction; consequently, the thin client nodes can focus on performing necessary tasks with limited computing capacity and energy endurance.

Initiative prefetching scheme can be applied in the distributed file system for a mobile cloud computing environment, in which there are many tablet computers and smart terminals. The current implementation of our proposed initiative prefetching scheme can classify only two access patterns and support two corresponding prediction algorithms for predicting future disk I/O access. We are planning to work on classifying patterns for a wider range of application benchmarks in the future by utilizing the horizontal visibility graph technique applying network delay aware replica selection techniques for reducing network transfer time when prefetching data among several replicas is another task in our future work.

Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites

05/08/201902/07/2019 by admin

With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences.

We propose a two-level framework which according to the user’s available history on the site, determines the best available privacy policy for the user’s images being uploaded. Our solution relies on an image classification framework for image categories which may be associated with similar policies, and on a policy prediction algorithm to automatically generate a policy for each newly uploaded image, also according to users’ social features. Over time, the generated policies will follow the evolution of users’ privacy attitude. We provide the results of our extensive evaluation over 5,000 policies, which demonstrate the effectiveness of our system, with prediction accuracies over 90 percent.

1.2 INTRODUCTION

Images are now one of the key enablers of users’ connectivity. Sharing takes place both among previously established groups of known people or social circles (e. g., Google+, Flickr or Picasa), and also increasingly with people outside the users social circles, for purposes of social discovery-to help them identify new peers and learn about peers interests and social surroundings. However, semantically rich images may reveal contentsensitive information. Consider a photo of a students 2012 graduationceremony, for example.

It could be shared within a Google+ circle or Flickr group, but may unnecessarily expose the studentsBApos familymembers and other friends. Sharing images within online content sharing sites,therefore,may quickly leadto unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings. One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone. Therefore, many have acknowledged the need of policy recommendation systems which can assist users to easily and properly configure privacy settings. However, existing proposals for automating privacy settings appear to be inadequate to address the unique privacy needs of images due to the amount of information implicitly carried within images, and their relationship with the online environment wherein they are exposed.

1.3 LITRATURE SURVEY

TITLE NAME: SHEEPDOG: GROUP AND TAG RECOMMENDATION FOR FLICKR PHOTOS BY AUTOMATIC SEARCH-BASED LEARNING

AUTHOR: H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu,

PUBLISH: Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.

EXPLANATION:

Online photo albums have been prevalent in recent years and have resulted in more and more applications developed to provide convenient functionalities for photo sharing. In this paper, we propose a system named SheepDog to automatically add photos into appropriate groups and recommend suitable tags for users on Flickr. We adopt concept detection to predict relevant concepts of a photo and probe into the issue about training data collection for concept classification. From the perspective of gathering training data by web searching, we introduce two mechanisms and investigate their performances of concept detection. Based on some existing information from Flickr, a ranking-based method is applied not only to obtain reliable training data, but also to provide reasonable group/tag recommendations for input photos. We evaluate this system with a rich set of photos and the results demonstrate the effectiveness of our work.

TITLE NAME: CONNECTING CONTENT TO COMMUNITY IN SOCIAL MEDIA VIA IMAGE CONTENT, USER TAGS AND USER COMMUNICATION

AUTHOR: M. D. Choudhury, H. Sundaram, Y.-R. Lin, A. John, and D. D. Seligmann

PUBLISH: Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp.1238–1241.

EXPLANATION:

In this paper we develop a recommendation framework to connect image content with communities in online social media. The problem is important because users are looking for useful feedback on their uploaded content, but finding the right community for feedback is challenging for the end user. Social media are characterized by both content and community. Hence, in our approach, we characterize images through three types of features: visual features, user generated text tags, and social interaction (user communication history in the form of comments). A recommendation framework based on learning a latent space representation of the groups is developed to recommend the most likely groups for a given image. The model was tested on a large corpus of Flickr images comprising 15,689 images. Our method outperforms the baseline method, with a mean precision 0.62 and mean recall 0.69. Importantly, we show that fusing image content, text tags with social interaction features outperforms the case of only using image content or tags.

TITLE NAME: ANALYSING FACEBOOK FEATURES TO SUPPORT EVENT DETECTION FOR PHOTO-BASED FACEBOOK APPLICATIONS

AUTHOR: M. Rabbath, P. Sandhaus, and S. Boll,

PUBLISH: Proc. 2nd ACM Int. Conf. Multimedia Retrieval, 2012, pp. 11:1–11:8.

EXPLANATION:

Facebook witnesses an explosion of the number of shared photos: With 100 million photo uploads a day it creates as much as a whole Flickr each two months in terms of volume. Facebook has also one of the healthiest platforms to support third party applications, many of which deal with photos and related events. While it is essential for many Facebook applications, until now there is no easy way to detect and link photos that are related to the same events, which are usually distributed between friends and albums. In this work, we introduce an approach that exploits Facebook features to link photos related to the same event. In the current situation where the EXIF header of photos is missing in Facebook, we extract visual-based, tagged areas-based, friendship-based and structure-based features. We evaluate each of these features and use the results in our approach. We introduce and evaluate a semi-supervised probabilistic approach that takes into account the evaluation of these features. In this approach we create a lookup table of the initialization values of our model variables and make it available for other Facebook applications or researchers to use. The evaluation of our approach showed promising results and it outperformed the other the baseline method of using the unsupervised EM algorithm in estimating the parameters of a Gaussian mixture model. We also give two examples of the applicability of this approach to help Facebook applications in better serving the user.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Image content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users’ private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview on privacy related visual content available on the Web techniques to automatically detect private images, and to enable privacy-oriented image search.

To this end, we learn privacy classifiers trained on a large set of manually assessed Flickr photos, combining textual metadata of images with a variety of visual features. We employ the resulting classification models for specifically searching for private photos, and for diversifying query results to provide users with a better coverage of private and public content. Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings.

One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone of policy recommendation systems which can assist users too easily and properly configure privacy settings.

2.1.1 DISADVANTAGES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations.
Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content.
The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

2.2 PROPOSED SYSTEM:

We propose an Adaptive Privacy Policy Prediction (A3P) system which aims to provide users a hassle free privacy settings experience by automatically generating personalized policies. The A3P system handles user uploaded images, and factors in the following criteria that influence one’s privacy settings of images:

The impact of social environment and personal characteristics: Social context of users, such as their profile information and relationships with others may provide useful information regarding users’ privacy preferences. For example, users interested in photography may like to share their photos with other amateur photographers. Users who have several family members among their social contacts may share with them pictures related to family events. However, using common policies across all users or across users with similar traits may be too simplistic and not satisfy individual preferences.

Users may have drastically different opinions even on the same type of images. For example, a privacy adverse person may be willing to share all his personal images while a more conservative person may just want to share personal images with his family members. In light of these considerations, it is important to find the balancing point between the impact of social environment and users’ individual characteristics in order to predict the policies that match each individual’s needs.

The role of image’s content and metadata: In general, similar images often incur similar privacy preferences, especially when people appear in the images. For example, one may upload several photos of his kids and specify that only his family members are allowed to see these photos. He may upload some other photos of landscapes which he took as a hobby and for these photos, he may set privacy preference allowing anyone to view and comment the photos. Analyzing the visual content may not be sufficient to capture users’ privacy preferences. Tags and other metadata are indicative of the social context of the image, including where it was taken and why, and also provide a synthetic description of images, complementing the information obtained from visual content analysis.

2.2.1 ADVANTAGES:

The A3P-core focuses on analyzing each individual user’s own images and metadata, while the A3P-Social offers a community perspective of privacy setting recommendations for a user’s potential privacy improvement.

Our algorithm in A3P-core (that is now parameterized based on user groups and also factors in possible outliers), and a new A3P-social module that develops the notion of social context to refine and extend the prediction power of our system.

We design the interaction flows between the two building blocks to balance the benefits from meeting personal characteristics and obtaining community advice.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

ADMIN:

USER:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

ADMIN:

USER:

3.3 CLASS DIAGRAM:

ADMIN:

USER:

3.4 SEQUENCE DIAGRAM:

ADMIN:

USER:

3.5 ACTIVITY DIAGRAM:

ADMIN:

USER:

CHAPTER 4

4.0 IMPLEMENTATION:

A3P-CORE

There are two major components in A3P-core: (i) Image classification and (ii) Adaptive policy prediction. For each user, his/her images are first classified based on content and metadata. Then, privacy policies of each category of images are analyzed for the policy prediction. Adopting a two-stage approach is more suitable for policy recommendation than applying the common one-stage data mining approaches to mine both image features and policies together. Recall that when a user uploads a new image, the user is waiting for a recommended policy.

The two-stage approach allows the system to employ the first stage to classify the new image and find the candidate sets of images for the subsequent policy recommendation. As for the one-stage mining approach, it would not be able to locate the right class of the new image because its classification criteria need both image features and policies whereas the policies of the new image are not available yet. Moreover, combining both image features and policies into a single classifier would lead to a system which is very dependent to the specific syntax of the policy. If a change in the supported policies were to be introduced, the whole learning model would need to change.

A3P-SOCIAL

The A3P-social employs a multi-criteria inference mechanism that generates representative policies by leveraging key information related to the user’s social context and his general attitude toward privacy. As mentioned earlier, A3Psocial will be invoked by the A3P-core in two scenarios. One is when the user is a newbie of a site, and does not have enough images stored for the A3P-core to infer meaningful and customized policies. The other is when the system notices significant changes of privacy trend in theuser’s social circle, which may be of interest for the user to possibly adjust his/her privacy settings accordingly. In what follows, we first present the types of social context considered by A3P-Social, and then present the policy recommendation process.

4.1 ALGORITHM

Our algorithm performs better for users with certain characteristics. Therefore, we study possible factors relevant to the performance of our algorithm. We used a least squares multiple regression analysis, regressing performance of the A3P-core to the following possible predictors:

4.2 MODULES:

WEB-BASED IMAGE SHARING SERVICES:

METADATA-BASED CLASSIFICATION:

CONTENT-BASED CLASSIFICATION:

ADAPTIVE POLICY PREDICTION:

4.3 MODULE DESCRIPTION:

WEB-BASED IMAGE SHARING SERVICES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information. We expected that frequency of sharing pictures and frequency of changing privacy settings would be significantly related to performance, but the results indicate that the frequency of social network use, frequency of uploading images and frequency of changing settings are not related to the performance our algorithm obtains with privacy settings predictions. This is a particularly useful result as it indicates that our algorithm will perform equally well for users who frequently use and share images on social networks as well as for users who may have limited access or limited information to share.

METADATA-BASED CLASSIFICATION:

We propose a hierarchical image classification which classifies images first based on their contents and then refine each category into subcategories based on their metadata. Images that do not have metadata will be grouped only by content. Such a hierarchical classification gives a higher priority to image content and minimizes the influence of missing tags. Note that it is possible that some images are included in multiple categories as long as they contain the typical content features or metadata based classification groups’ images into subcategories under aforementioned baseline categories.

The process consists of three main steps.

The third step is to find a subcategory that an image belongs to. This is an incremental procedure. At the beginning, the first image forms a subcategory as itself and the representative hypernyms of the image becomes the subcategory’s representative hypernyms. Then, we compute the distance between representative hypernyms of a new incoming image and each existing subcategory.

CONTENT-BASED CLASSIFICATION:

Our approach to content-based classification is based on an efficient and yet accurate image similarity approach. Specifically, our classification algorithm compares image signatures defined based on quantified and sanitized version of Haar wavelet transformation. For each image, the wavelet transform encodes frequency and spatial information related to image color, size, invariant transform, shape, texture, symmetry, etc. Then, a small number of coefficients are selected to form the signature of the image. The content similarity among images is then determined by the distance among their image signatures.

Our selected similarity criteria include texture, symmetry, shape (radial symmetry and phase congruency and SIFT. We also account for color and size. We set the system to start from five generic image classes: (a) explicit (e.g., nudity, violence, drinking etc), (b) adults, (c) kids, (d) scenery (e.g., beach, mountains), (e) animals. As a preprocessing step, we populate the five baseline classes by manually assigning to each class a number of images crawled from Google images, resulting in about 1,000 images per class. Having a large image data set beforehand reduces the chance of misclassification. Then, we generate signatures of all the images and store them in the database.

Our content classifier, we conducted some preliminary test to evaluate its accuracy. Precisely, we tested our classifier it against a ground-truth data set, Image-net.org. In Image-net, over 10 million images are collected and classified according to the wordnet structure. For each image class, we use the first half set of images as the training data set and classify the next 800 images. The classification result was recorded as correct if the synset’s main search term or the direct hypernym is returned as a class. The average accuracy of our classifier is above 94 percent.

ADAPTIVE POLICY PREDICTION:

The policy prediction algorithm provides a predicted policy of a newly uploaded image to the user for his/her reference. More importantly, the predicted policy will reflect the possible changes of a user’s privacy concerns. The prediction process consists of three main phases: (i) policy normalization; (ii) policy mining; and (iii) policy prediction. The policy normalization is a simple decomposition process to convert a user policy into a set of atomic rules in which the data (D) component is a single-element set.

We propose a hierarchical mining approach for policy mining. Our approach leverages association rule mining techniques to discover popular patterns in policies. Policy mining is carried out within the same category of the new image because images in the same category are more likely under the similar level of privacy protection. The basic idea of the hierarchical mining is to follow a natural order in which a user defines a policy.

Given an image, a user usually first decides who can access the image, then thinks about what specific access rights (e.g., view only or download) should be given, and finally refine the access conditions such as setting the expiration date. Correspondingly, the hierarchical mining first look for popular subjects defined by the user, then look for popular actions in the policies containing the popular subjects, and finally for popular conditions in the policies containing both popular subjects and conditions.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

A3P-Social, we achieve a much higher accuracy, demonstrating that just simply considering privacy inclination is not enough, and that ”social-context” truly matters. Precisely the overall accuracy of A3P-social is above 95 percent. For 88.6 percent of the users, all predicted policies are correct, and the number of missed policies is 33 (for over 2,600 predictions). Also, we note that in this case, there is no significant difference across image types.

We compared the performance of the A3P-Social with alternative, popular, recommendation methods: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. In our case, the vectors are the users’ attributes defining their social profile. The algorithm using Cosine similarity scans all users profiles, computes Cosine similarity of the social contexts between the new user and the existing users. Then, it finds the top two users with the highest similarity score with the candidate user and feeds the associated images to the remaining functions in the A3P-core.

We have proposed an Adaptive Privacy Policy Prediction (A3P) system that helps users automate the privacy policy settings for their uploaded images. The A3P system provides a comprehensive framework to infer privacy preferences based on the information available for a given user. We also effectively tackled the issue of cold-start, leveraging social context information. Our experimental study proves that our A3P is a practical tool that offers significant improvements over current approaches to privacy.

Predicting Asthma-Related Emergency Department Visits Using Big Data

05/08/201902/07/2019 by admin

Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% precision. The results can be helpful for public health surveillance, emergency department preparedness, and, targeted patient interventions.

1.2 INTRODUCTION:

Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.

Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.

Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

Another notable non-traditional disease surveillance systemhas been a data-aggregating tool called Google Flu Trends which uses aggregated search data to estimate flu activity. Google Trends was quite successful in its estimation of influenza-like illness. It is based on Google’s search engine which tracks how often a particular search-term is entered relative to the total search-volume across a particular area. This enables access to the latest data from web search interest trends on a variety of topics, including diseases like asthma. Air pollutants are known triggers for asthma symptoms and exacerbations. The United States Environmental Protection Agency (EPA) provides access to monitored air quality data collected at outdoor sensors across the country which could be used as a data source for asthma prediction. Meanwhile, as health reform progresses, the quantity and variety of health records being made available electronically are increasing dramatically. In contrast to traditional disease surveillance systems, these new data sources have the potential to enable health organizations to respond to chronic conditions, like asthma, in real time. This in turn implies that health organizations can appropriately plan for staffing and equipment availability in a flexible manner. They can also provide early warning signals to the people at risk for asthma adverse events, and enable timely, proactive, and targeted preventive and therapeutic interventions.

1.3 LITRATURE SURVEY:

USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION

AUTHOR: Kim, Eui-Ki, et al.

PUBLISH: PloS one vol. 8, no.7, e69305, 2013.

EXPLANATION:

Influenza epidemics arise through the accumulation of viral genetic changes. The emergence of new virus strains coincides with a higher level of influenza-like illness (ILI), which is seen as a peak of a normal season. Monitoring the spread of an epidemic influenza in populations is a difficult and important task. Twitter is a free social networking service whose messages can improve the accuracy of forecasting models by providing early warnings of influenza outbreaks. In this study, we have examined the use of information embedded in the Hangeul Twitter stream to detect rapidly evolving public awareness or concern with respect to influenza transmission and developed regression models that can track levels of actual disease activity and predict influenza epidemics in the real world. Our prediction model using a delay mode provides not only a real-time assessment of the current influenza epidemic activity but also a significant improvement in prediction performance at the initial phase of ILI peak when prediction is of most importance.

A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS

AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.

PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.

EXPLANATION:

Traditional disease surveillance is a very time consuming reporting process. Cases of notifiable diseases are reported to the different levels in the national health care system before actions can be taken. But, early detection of disease activity followed by a rapid response is crucial to reduce the impact of epidemics. To address this challenge, alternative sources of information are investigated for disease surveillance. In this paper, the relevance of twitter messages outbreak detection is investigated from two directions. First, Twitter messages potentially related to disease outbreaks are retrospectively searched and analyzed. Second, incoming twitter messages are assessed with respect to their relevance for outbreak detection. The studies show that twitter messages can be – to a certain extent – highly relevant for early detecting hints to public health threats. According to the law on German Protection against Infection Act (Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance relies on data from mandatory reporting of cases by physicians and laboratories. They inform local county health departments (Landkreis) which in turn report to state health departments (Land). At the end of the reporting pipeline, the national surveillance institute (Robert Koch Institute) is informed about the outbreak. It is clear that these different stages of reporting take time and delay a timely reaction.

TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES

AUTHOR: Culotta, Aron.

PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.

EXPLANATION:

Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.

Google messages and their relevance for disease outbreak detection has been reported already that especially tweets are useful to predict outbreaks such as a Norovirus outbreak at a university analysed twitter news during the influenza epidemic 2009. They compared the use of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed the content of the tweets (ten content concepts) and validated twitter as a the real time content. They analysed the data via Infovigil an infosurveillance system by using an automated coding. To find out if there is a relationship between automated and manual coding, the tweets were evaluated by a Pearson´s correlation. Chew et al. found a significant correlation between both coding in seven content concept it needs to be investigated whether this source might be of relevance for detecting disease outbreaks in Germany. Therefore, only German keywords are exploited to identify Twitter messages. Further, we are not only interested in influenza-like illnesses as the studies available so far, but also in other infectious diseases (e.g. Norovirus and Salmonella).

2.1.1 DISADVANTAGES:

Existing methods have a common format:

[username]

[text] [date time client]. The length is restricted to 140 characters. In terms of linguistics, each twitter user can write as he or she likes. Thus, the variety reaches from complete sentences to listing of keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote topics and are primarily utilized by experienced users categories google according to their contents in more details, google messages can • Provide information, • Express opinions or • Report personal issues is provided, the authority of that information cannot normally not be determined, so it might be unverified information. Opinions are often expressed with humor or sarcasm and may be highly contradictive in the emotions that are expressed.

2.2 PROPOSED SYSTEM:

Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

Research studies have explored the use of novel data sources to propose rapid, cost-effective health status surveillance methodologies. Some of the early studies rely on document classification suggesting that Twitter data can be highly relevant for early detection of public health threats. Others employ more complex linguistic analysis, such as the Ailment Topic Aspect Model which is useful for syndrome surveillance. This type of analysis is useful for demonstrating the significance of social media as a promising new data source for health surveillance. Other recent studies have linked social media data with real world disease incidence to generate actionable knowledge useful for making health care decisions. These include which analyzed Twitter messages related to influenza and correlated them with reported CDC statistics validated Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Collier employed supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories. This study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data.

2.2.1 ADVANTAGES:

Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.

The main contributions of this work are:

• Analysis of tweets with respect to their relevance for disease surveillance,

• Content analysis and content classification of tweets,

• Linguistic analysis of disease-reporting twitter messages,

• Recommendations on search patterns for tweet search in the context of disease surveillance.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

Alert Email

Filter Tweet

Asthma Tweets

New Tweet

Friend Follow

Friends list

CHAPTER 4

4.0 IMPLEMENTATION:

DISEASE CONTROL AND PREVENTION (CDC):

Current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments [4]. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that are not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection.

Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information.

4.1 ALGORITHM:

MACHINE LEARNING ALGORITHMS:

Our research objective is to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days). To this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

4.2 MODULES:

EMERGENCY DEPARTMENT VISITS:

ENVIRONMENTAL SENSORS (EMR):

OUR PREDICTION SENSOR DATA:

ASTHMA PREDICTION RESULTS:

4.3 MODULE DESCRIPTION:

EMERGENCY DEPARTMENT VISITS:

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers.

The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics.

ENVIRONMENTAL SENSORS (EMR):

Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment.

For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

OUR PREDICTION SENSOR DATA:

We first analyzed the association between the asthma-related ED visits and data from Twitter, Google trends, and Air Quality sensors, using the Pearson correlation coefficient. We also examined the association between asthma-related tweet counts and ED visit counts for abdominal pain/constipation patients, to control for non-asthma-specific variations in ED visit counts. Then, we designed and implemented a prediction model to estimate the incidence of asthma ED visits at CMC using a combination of independent variables from the above data sources.

Twitter offers streaming APIs to give developers and researchers low latency access to its global stream of data. Public streams, which can provide access to the public data flowing through Twitter, were used in this study. Studies have estimated that using Twitter’s Streaming API, researchers can expect to receive 1% of the tweets in near real-time. Twitter4j, an unofficial Java library for the Twitter API, was used to access tweet information from the Twitter Streaming API.

Two different Twitter data sets were collected in this study:

(1) The general twitter stream: a simple collection of JSON grabbed from the general Twitter stream. The general tweet counts were used to estimate the Twitter population and further normalize asthma tweet counts.

(2) The asthma-related stream: to collect only tweets containing any of 19 related keywords that were suggested by our clinical collaborators from PCCI. The asthma stream is limited to 1% of full tweets as well.

ASTHMA PREDICTION RESULTS:

Our results from the correlation analysis, asthma tweets, CO, NO2 and PM2.5 were selected as inputs into our prediction model. We are only reporting results for the Decision Tree and Artificial Neural Networks (ANN) techniques, as the Naive Bayes and SVM techniques did not yield good prediction results. First, backward feature selection algorithm was used to examine if the addition of Twitter data would improve the prediction. As shown in Table VI, combining air quality data with tweets resulted in higher prediction accuracy. Additionally, we evaluated prediction precision. Given that our prediction task is for a three-way classification, each technique resulted in different prediction and/or precision in different classes (Table VII). Decision Tree performed well in predicting the “High” class, while ANN, after Adaptive Boosting, worked well for the “Low” class. Stacking the two techniques performed well for the “Medium” class.

The results of our analysis are promising because they perform with a fairly high level of accuracy overall. As noted in the introduction, traditional asthma ED visit models are useful for predicting events in a three month window and have an accuracy of approximately 70%. It is to be noted that “traditional models” estimate a risk score for asthma ED visit for each individual patient, whereas our “Twitter/ Environmental data model” predicts the risk for a daily number of ED visits being High, Low, or Medium. The former is an individual-level risk model, while the latter is a population-level risk model. Our population-level asthma risk prediction model has the potential for complementing current individual-level models, and may lead to a shorter time window and better accuracy of prediction. This in turn has implications for better planning and proactive treatment in specific geo-locations at specific time periods.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

In this study, we have provided preliminary evidence that social media and environmental data can be leveraged to accurately predict asthma ED visits at a population level. We are in the process of confirming these preliminary findings by collecting larger clinical datasets across different seasons and multiple hospitals. Our continued work is focused on extending this research to propose a temporal prediction model that analyzes the trends in tweets and air quality index changes, and estimates the time lag between these changes and the number of asthma ED visits.

We also are collecting air quality index data over a longer time period to examine the effects of seasonal variations. In addition, we would like to explore the effect of relevant data from other types of social media interactions, e.g. blogs and discussion forums, on our asthma visit prediction model. Additional studies are needed to examine how combining real-time or near-real-time social media and environmental data with more traditional data might affect the performance and timing of current individual-level prediction models for asthma, and eventually, for other chronic conditions. In future projects, we intend to extend our work to diseases with geographical and temporal variability, e.g., COPD and diabetes.

Performing Initiative Data Prefetching in Distributed File Systems for Cloud Computing

05/08/201902/07/2019 by admin

1.2 INTRODUCTION

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

This paper makes the following two contributions:

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter

05/08/201902/07/2019 by admin

It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.

This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.

This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.

These results can help further reveal IP spoofing, which has been studied for long but never well understood. Though PIT cannot work in all the spoofing attacks, it may be the most useful mechanism to trace spoofers before an Internet-level traceback system has been deployed in real.

1.2 INTRODUCTION

IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.

To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.

Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

Our work has the following contributions:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities. Though Moore et al. [8] has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback. 2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real. 3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

1.3 LITRATURE SURVEY

DEFENSE AGAINST SPOOFED IP TRAFFIC USING HOP-COUNT FILTERING

PUBLICATION: IEEE/ACM Trans. Netw., vol. 15, no. 1, pp. 40–53, Feb. 2007.

AUTHORS: H. Wang, C. Jin, and K. G. Shin

EXPLANATION:

IP spoofing has often been exploited by Distributed Denial of Service (DDoS) attacks to: 1)conceal flooding sources and dilute localities in flooding traffic, and 2)coax legitimate hosts into becoming reflectors, redirecting and amplifying flooding traffic. Thus, the ability to filter spoofed IP packets near victim servers is essential to their own protection and prevention of becoming involuntary DoS reflectors. Although an attacker can forge any field in the IP header, he cannot falsify the number of hops an IP packet takes to reach its destination. More importantly, since the hop-count values are diverse, an attacker cannot randomly spoof IP addresses while maintaining consistent hop-counts. On the other hand, an Internet server can easily infer the hop-count information from the Time-to-Live (TTL) field of the IP header. Using a mapping between IP addresses and their hop-counts, the server can distinguish spoofed IP packets from legitimate ones. Based on this observation, we present a novel filtering technique, called Hop-Count Filtering (HCF)-which builds an accurate IP-to-hop-count (IP2HC) mapping table-to detect and discard spoofed IP packets. HCF is easy to deploy, as it does not require any support from the underlying network. Through analysis using network measurement data, we show that HCF can identify close to 90% of spoofed IP packets, and then discard them with little collateral damage. We implement and evaluate HCF in the Linux kernel, demonstrating its effectiveness with experimental measurements

DYNAMIC PROBABILISTIC PACKET MARKING FOR EFFICIENT IP TRACEBACK

PUBLICATION: Comput. Netw., vol. 51, no. 3, pp. 866–882, 2007.

AUTHORS: J. Liu, Z.-J. Lee, and Y.-C. Chung

EXPLANATION:

Recently, denial-of-service (DoS) attack has become a pressing problem due to the lack of an efficient method to locate the real attackers and ease of launching an attack with readily available source codes on the Internet. Traceback is a subtle scheme to tackle DoS attacks. Probabilistic packet marking (PPM) is a new way for practical IP traceback. Although PPM enables a victim to pinpoint the attacker’s origin to within 2–5 equally possible sites, it has been shown that PPM suffers from uncertainty under spoofed marking attack. Furthermore, the uncertainty factor can be amplified significantly under distributed DoS attack, which may diminish the effectiveness of PPM. In this work, we present a new approach, called dynamic probabilistic packet marking (DPPM), to further improve the effectiveness of PPM. Instead of using a fixed marking probability, we propose to deduce the traveling distance of a packet and then choose a proper marking probability. DPPM may completely remove uncertainty and enable victims to precisely pinpoint the attacking origin even under spoofed marking DoS attacks. DPPM supports incremental deployment. Formal analysis indicates that DPPM outperforms PPM in most aspects.

FLEXIBLE DETERMINISTIC PACKET MARKING: AN IP TRACEBACK SYSTEM TO FIND THE REAL SOURCE OF ATTACKS

PUBLICATION: EEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.

AUTHORS: Y. Xiang, W. Zhou, and M. Guo

EXPLANATION:

IP traceback is the enabling technology to control Internet crime. In this paper we present a novel and practical IP traceback system called Flexible Deterministic Packet Marking (FDPM) which provides a defense system with the ability to find out the real sources of attacking packets that traverse through the network. While a number of other traceback schemes exist, FDPM provides innovative features to trace the source of IP packets and can obtain better tracing capability than others. In particular, FDPM adopts a flexible mark length strategy to make it compatible to different network environments; it also adaptively changes its marking rate according to the load of the participating router by a flexible flow-based marking scheme. Evaluations on both simulation and real system implementation demonstrate that FDPM requires a moderately small number of packets to complete the traceback process; add little additional load to routers and can trace a large number of sources in one traceback process with low false positive rates. The built-in overload prevention mechanism makes this system capable of achieving a satisfactory traceback result even when the router is heavily loaded. It has been used to not only trace DDoS attacking packets but also enhance filtering attacking traffic.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods of the IP marking approach is that routers probabilistically write some encoding of partial path information into the packets during forwarding. A basic technique, the edge sampling algorithm, is to write edge information into the packets. This scheme reserves two static fields of the size of IP address, start and end, and a static distance field in each packet. Each router updates these fields as follows. Each router marks the packet with a probability. When the router decides to mark the packet, it writes its own IP address into the start field and writes zero into the distance field. Otherwise, if the distance field is already zero which indicates its previous router marked the packet, it writes its own IP address into the end field, thus represents the edge between itself and the previous routers.

Previous router doesn’t mark the packet, then it always increments the distance field. Thus the distance field in the packet indicates the number of routers the packet has traversed from the router which marked the packet to the victim. The distance field should be updated using a saturating addition, meaning that the distance field is not allowed to wrap. The mandatory increment of the distance field is used to avoid spoofing by an attacker. Using such a scheme, any packet written by the attacker will have distance field greater than or equal to the length of the real attack path a router false positive if it is in the reconstructed attack graph but not in the real attack graph. Similarly we call a router false negative if it is in the true attack graph but not in the reconstructed attack graph. We call a solution to the IP traceback problem robust if it has very low rate of false negatives and false positives.

2.1.1 DISADVANTAGES:

Existing approach has a very high computation overhead for the victim to reconstruct the attack paths, and gives a large number of false positives when the denial-of-service attack originates from multiple attackers.

Existing approach can require days of computation to reconstruct the attack paths and give thousands of false positives even when there are only 25 distributed attackers. This approach is also vulnerable to compromised routers.

If a router is compromised, it can forge markings from other uncompromised routers and hence lead the reconstruction to wrong results. Even worse, the victim will not be able to tell a router is compromised just from the information in the packets it receives problem.

2.2 PROPOSED SYSTEM:

We propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

We present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

2.2.1 ADVANTAGES:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback.

2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real.

3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM:

LEVEL 1

Base station

View request

Router check the node

Message send via router

LEVEL 2

Node

Exists

Send request

Receive message

Check IP Address & check verification node

Clear Spoofing Attacks

LEVEL 3

Router

IP Address

Router check the each node

Check verification same/diff node to each data

Response to node

Detect Spoofing Origin and send message to original node

3.2.1 UML DIAGRAMS:

3.2.2 USE CASE DIAGRAM:

Base station

Router

Create message

View request

Message send via router

Router check each node

Check verification same/diff node to each data

Response to client

Detect spoofing origin

Send message

Node

Send request

3.2.3 CLASS DIAGRAM:

Node

IP Adress

Send request

View message ()

Base station

IP Address

View request

Send message via router

Socket connection () ()

Send message () ()

Router

IP Address

Router check the each node

Detectsppofing() ()

Receive message ()

Response to nde

Send message() ()

3.2.4 SEQUENCE DIAGRAM:

Connection established

Send encoded data

Check verification

Form routing

Routing Finished

Detect Spoofing

Connection terminate

Source

Base station

Destination

Establish communication

Connection established

Receiving Ack

Data received

Routing Success

3.2. ACTIVITY DIAGRAM:

Node

Router

Check

Check verification same/diff node to each data

Router check the each node

Clear jamming and send message to node

Response to client

Yes Start msg receive

IP Address & View request

IP Address

Send request

Message Received

Base station

Message send via router

Router check the node

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

We designed an algorithm specified in Fig. 6. This algorithm first finds a shortest path from r to od. From the second vertex along the path, it checks if the removal of the vertex can break r and od. Whenever such a vertex c is found, removing the vertex from G, and the set containing all the verticals which are still connected with r is just the suspect set.

4.2 MODULES:

NETWORK SECURITY:

DENIAL OF SERVICE (DOS):

PATH BACKSCATTER:

IP SPOOFING METHOD:

IP TRACEBACK METHOD:

4.3 MODULE DESCRIPTION:

NETWORK SECURITY:

Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.

DENIAL OF SERVICE (DOS):

In computing, a denial-of-service (DoS) attack is an attempt to make a machine or network resource unavailable to its intended users, such as to temporarily or indefinitely interrupt or suspend services of a host connected to the Internet. A distributed denial-of-service (DDoS) is where the attack source is more than one, often thousands of, unique IP addresses. It is analogous to a group of people crowding the entry door or gate to a shop or business, and not letting legitimate parties enter into the shop or business, disrupting normal operations.

Criminal perpetrators of DoS attacks often target sites or services hosted on high-profile web servers such as banks, credit card payment gateways; but motives of revenge, blackmail or activism can be behind other attacks. A denial-of-service attack is characterized by an explicit attempt by attackers to prevent legitimate users of a service from using that service. There are two general forms of DoS attacks: those that crash services and those that flood services.

The most serious attacks are distributed and in many or most cases involve forging of IP sender addresses (IP address spoofing) so that the location of the attacking machines cannot easily be identified, nor can filtering be done based on the source address.

PATH BACKSCATTER:

We presented a preliminary statistical result on path backscatter messages and discussed it is possible to trace spoofers based on the messages. However, the generation and collection of path backscatter messages are not well investigated, and the traceback mechanisms are not designed. In this article, we make a thorough analysis on path backscatter messages, present the traceback mechanisms and give the traceback results. 2. Each message contains the source address of the reflecting device, and the IP header of the original packet. Thus, from each path backscatter, we can get 1) the IP address of the reflecting device which is on the path from the attacker to the destination of the spoofing packet; 2) the IP address of the original destination of the spoofing packet. The original IP header also contains other valuable information, e.g., the remaining TTL of the spoofing packet. Note that due to some network devices may perform address rewrite (e.g., NAT), the original source address and the destination address may be different.

IP SPOOFING METHOD:

Our tracking mechanisms actually have two limitations. The first is the network topology and mapping from addresses of r and od must be known. The second is the tracking is actually performed based on loose assumptions on paths. Thus, only when path backscatter messages are from very special vertex, i.e., stub AS, the spoofer can be accurately located. In this section, we discuss how to break these limitations through using other information contained in path backscatter messages.

We found there are three special types of path backscatter messages which are more useful for tracing spoofers:

1) The path backscatter messages whose original hop count is 0 or 1. Such messages are generated 1 or 2 hops from the spoofers. Very possibly they are from the gateway of the spoofer.

2) The path backscatter messages whose type is ‘Redirect’. Such messages must be from a gateway of the spoofer.

3) The path backscatter messages whose original destination is a private address or unallocated address. Such messages are typically generated by the first DFZ router on the path from the spoofer to the original destination, e.g., the egress router of the AS in which the spoofer resides. Though such path backscatter messages are generated in very special cases, they are not rare. Especially, there are a large number of path backscatter messages whose original destination is a private address.

IP TRACEBACK METHOD:

PIT is very different from any existing traceback mechanism. The main difference is the generation of path backscatter message is not of a certain probability. Thus, we separate the evaluation into 3 parts: the first is the statistical results on path backscatter messages; the second is the evaluation on the traceback mechanisms presented in considering uncertainness of path backscatter generation, since effectiveness of the mechanisms is actually determined by the structure features of the networks; the last is the result of performing the traceback mechanisms on the path backscatter message dataset.

In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known. We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION:

We try to dissipate the mist on the the locations of spoofers based on investigating the path backscatter messages. In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known.

We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

Optimal Configuration of Network Coding in Ad Hoc Networks

05/08/201902/07/2019 by admin

Abstract:

Analyze the impact of network coding (NC) configuration on the performance of ad hoc networks with the consideration of two significant factors, namely, the throughput loss and the decoding loss, which are jointly treated as the overhead of NC. In particular, physical-layer NC and random linear NC are adopted in static and mobile ad hoc networks (MANETs), respectively. Furthermore, we characterize the good put and delay/good put tradeoff in static networks, which are also analyzed in MANETs for different mobility models (i.e., the random independent and identically distributed mobility model and the random walk model) and transmission schemes.

Introduction:

Network coding was initially designed as a kind of Source coding. Further studies showed that the Capacity of wired networks can be improved by network coding (NC), which can fully utilize the network resources.

Due to This advantage, how to employ NC in wireless ad hoc networks has been intensively studied in recent years with the Purpose of improving the throughput and delay performance. The main difference between wired networks and wireless Networks is that there is non ignorable interference between Nodes in wireless networks.

Therefore, it is important to design the NC in wireless ad hoc networks with interference to achieve the improvement on system performance such as good put and delay/good put tradeoff.

Existing System:

The probability that the random linear NC was valid for a multicast connection problem on an arbitrary network with independent sources was at least (1 − d/q)η, where η was the number of links with associated random coefficients, d was the number of receivers, and q was the size of Galois field Fq.

It was obvious that a large q was required to guarantee that the system with RLNC was valid. When considering the given two factors, the traditional definition of throughput in ad hoc networks is no longer appropriate since it does not consider the bits of NC coefficients and the linearly correlated packets that do not carry any valuable data. Instead, the good put and the delay/good put tradeoff are investigated in this paper, which only take into account the successfully decoded data.

Moreover, if we treat the data size of each packet, the generation size (the number of packets that are combined by NC as a group), and the NC coefficient Galois field as the configuration of NC, it is necessary to find the scaling laws of the optimal configuration for a given network model and transmission scheme.

Disadvantages:

Throughput loss.
The decoding loss.
Time delay.

Proposed System:

Proposed system with the basic idea of NC and the scaling laws of throughput loss and decoding loss. Furthermore, some useful concepts and parameters are listed. Finally, we give the definitions of some network performance metrics.

Physical layer Network Coding designed based on the channel state information (CSI) and network topology. The PNC is appropriate for the static networks since the CSI and network topology are preknown in the static case.

There are G nodes in one cell, and node i (i = 1, 2, . . . , G) holds packet xi. All of the G packets are independent, and they belong to the same unicast session. The packets are transmitted to a node i’ in the next cell simultaneously. gii’ is a complex number that represents the CSI between i and i’ in the frequency domain.

Advantages:

System minimizes data loss.
System reduces time delay.

Modules:

Network Topology:

The networks that consist of n randomly and evenly distributed static nodes in a unit square area. These nodes are randomly grouped into S–D pairs.

Transmission Model:

The protocol model, which is a simplified version of the physical model since it ignores the long-distance interference and transmission. Moreover, it is indicated in that the physical model can be treated as the protocol model on scaling laws when the transmission is allowed if the signal-to-interference-plus-noise ratio is larger than a given threshold.

Transmission Schemes for Mobile Networks:

When the relay receives a new packet, it combines the packet it has with that it receives by randomly selected coefficients and then generates a new packet. Simultaneous transmission in one cell is not allowed since it is hard for the receiver to obtain multiple CSI from different transmitters at the same time. Hence, we employ the random linear NC for mobile models.

Conclusion:

Analyzed the NC configuration in both static and mobile ad hoc networks to optimize the delay good put tradeoff and the good put with the consideration of the

Through put loss and decoding loss of NC. These results reveal the impact of network scale on the NC system, which has not been studied in previous works. Moreover, we also compared the performance with the corresponding networks without NC.

On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications

05/08/201902/07/2019 by admin

MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.2 INTRODUCTION

MapReduce has emerged as the most popular computing framework for big data processing due to its simple programming model and automatic management of parallel execution. MapReduce and its open source implementation Hadoop have been adopted by leading companies, such as Yahoo!, Google and Facebook, for various big data applications, such as machine learning bioinformatics and cybersecurity. MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result.

There is a shuffle step between map and reduce phase.

In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications. For example, with tens of thousands of machines, data shuffling accounts for 58.6% of the cross-pod traffic and amounts to over 200 petabytes in total in the analysis of SCOPE jobs. For shuffle-heavy MapReduce tasks, the high traffic could incur considerable performance overhead up to 30-40 % as shown in default, intermediate data are shuffled according to a hash function in Hadoop, which would lead to large network traffic because it ignores network topology and data size associated with each key.

We consider a toy example with two map tasks and two reduce tasks, where intermediate data of three keys K1, K2, and K3 are denoted by rectangle bars under each machine. If the hash function assigns data of K1 and K3 to reducer 1, and K2 to reducer 2, a large amount of traffic will go through the top switch. To tackle this problem incurred by the traffic-oblivious partition scheme, we take into account of both task locations and data size associated with each key in this paper. By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced. In the same example above, if we assign K1 and K3 to reducer 2, and K2 to reducer 1, as shown in Fig. 1(b), the data transferred through the top switch will be significantly reduced.

To further reduce network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced.

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.3 LITRATURE SURVEY

MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

AUTHOR: Dean and S. Ghemawat

PUBLISH: Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

EXPLANATION:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

CLOUDBLAST: COMBINING MAPREDUCE AND VIRTUALIZATION ON DISTRIBUTED RESOURCES FOR BIOINFORMATICS APPLICATIONS

AUTHOR: A. Matsunaga, M. Tsugawa, and J. Fortes,

PUBLISH: IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.

EXPLANATION:

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

MAP TASK SCHEDULING IN MAPREDUCE WITH DATA LOCALITY: THROUGHPUT AND HEAVY-TRAFFIC OPTIMALITY

AUTHOR: W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang

PUBLISH: INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.

EXPLANATION:

Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay.

We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little’s law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing problem of optimizing network usage in MapReduce scheduling in the reason that we are interested in network usage is twofold. Firstly, network utilization is a quantity of independent interest, as it is directly related to the throughput of the system. Note that the total amount of data processed in unit time is simply (CPU utilization)·(CPU capacity)+ (network utilization)·(network capacity). CPU utilization will always be 1 as long as there are enough jobs in the map queue, but network utilization can be very sensitive to scheduling network utilization has been identified as a key component in optimization of MapReduce systems in several previous works.

Network usage could lead us to algorithms with smaller mean response time. We find the main motivation for this direction of our work in the results of the aforementioned overlap between map and shuffle phases, are shown to yield significantly better mean response time than Hadoop’s fair scheduler. However, we observed that neither of these two algorithms explicitly attempted to optimize network usage, which suggested room for improvement. MapReduce has become one of the most popular frameworks for large-scale distributed computing, there exists a huge body of work regarding performance optimization of MapReduce.

For instance, researchers have tried to optimize MapReduce systems by efficiently detecting and eliminating the so-called “stragglers” providing better locality of data preventing starvation caused by large jobs analyzing the problem from a purely theoretical viewpoint of shuffle workload available at any given time is closely related to the output rate of the map phase, due to the inherent dependency between the map and shuffle phases. In particular, when the job that is being processed is ‘map-heavy,’ the available workload of the same job in the shuffle phase is upper-bounded by the output rate of the map phase. Therefore, poor scheduling of map tasks can have adverse effects on the throughput of the shuffle phase, causing the network to be idle and the efficiency of the entire system to decrease.

2.1.1 DISADVANTAGES:

Existing model, called the overlapping tandem queue model, is a job-level model for MapReduce where the map and shuffle phases of the MapReduce framework are modeled as two queues that are put in tandem. Since it is a job-level model, each job is represented by only the map size and the shuffle size simplification is justified by the introduction of two main assumptions. The first assumption states that each job consists of a large number of small-sized tasks, which allows us to represent the progress of each phase by real numbers.

The job-level model offers two big disadvantages over the more complicated task-level models.

Firstly, it gives rise to algorithms that are much simpler than those of task-level models, which enhances chances of being deployed in an actual system.

Secondly, the number of jobs in a system is often smaller than the number of tasks by several orders of magnitude, making the problem computationally much less strenuous note that there are still some questions to be studied regarding the general applicability of the additional assumptions of the job-level model, which are interesting research questions in their own light

2.2 PROPOSED SYSTEM:

MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines in this locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

We have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

2.2.1 ADVANTAGES:

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations.

Our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger.

Our distributed algorithm with the other two schemes a default simulation setting with a number of parameters, and then study the performance by changing one parameter while fixing others. We consider a MapReduce job with 100 keys and other parameters are the same above. the network traffic cost shows as an increasing function of number of keys from 1 to 100 under all algorithms.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tool : Netbean 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

SERVER

Access Layer

Cross Layer

Use Hash Partition

Traffic Aware Partition

Send data through head node

Mapper

RECEIVER

Aggregation Layer

Map Reducer

OHRA

OHNA

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

Source

Destination

Establish connection

Send the data

Data send into destination

Data Aggregation Layer

Receive data

Neighbor Nodes

View data

Base station

Form the cluster

3.3 CLASS DIAGRAM:

Source

Base station

System Address

Data Send ()

Data send

Data info

Destination address

Data Send

Transmitting ()

Destination

System Address ()

Maintain Details

Verify ()

Receive data ()

View data ()

Connection ()

Move Nodes

Node info

Data length

Hop routing ()

3.4 SEQUENCE DIAGRAM:

Connection established

Send data Data Aggregation Layer Form routing Routing Finished Traffic Aware Partition Connection terminate Source Base station Destination Establish communication Connection established Receiving Ack Data received Map Reducer

3.5 ACTIVITY DIAGRAM:

Source

Destination

False

Receive data

View data

True

False

Connection establish

Send data

Aggregation Node

Receive Ack

True

Using Mapper

Data transfer

Map Reducer

Base station

CHAPTER 4

4.0 IMPLEMENTATION:

ONLINE EXTENSION OF HRA AND HNA

In this section, we conduct extensive simulations to evaluate the performance of our proposed distributed algorithm DA. We compare DA with HNA, which is the default method in Hadoop. To our best knowledge, we are the first to propose the aggregator placement algorithm, and compared with the HRA that focuses on a random aggregator placement. All simulation results are averaged over 30 random instances.

• HNA: Hash-based partition with No Aggregation. It exploits the traditional hash partitioning for the intermediate data, which are transferred to reducers without going through aggregators. It is the default method in Hadoop.

• HRA: Hash-based partition with Random Aggregation. It adds a random aggregator placement algorithm based on the traditional Hadoop. Through randomly placing aggregators in the shuffle phase, it aims to reducing the network traffic cost in the comparison of traditional method in Hadoop.

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations. Each key associated with random data size within [1-50]. There are 20 mappers, and 2 reducers on a cluster of 20 machines. The parameter α is set to 0.5. The distance between any two machines is randomly chosen within [1-60]. As shown in Fig. 7, the performance of our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger. When the number of keys is set to 10, the default algorithm HNA has a cost of 5.0 × 104 while optimal solution is only 2.7×104 , with 46% traffic reduction.

4.1 ALGORITHM

DISTRIBUTED ALGORITHM

The problem above can be solved by highly efficient approximation algorithms, e.g., branch-and-bound, and fast off-the-shelf solvers, e.g., CPLEX, for moderate-sized input. An additional challenge arises in dealing with the MapReduce job for big data. In such a job, there are hundreds or even thousands of keys, each of which is associated with a set of variables (e.g., x p ij and y p k ) and constraints in our formulation, leading to a large-scale optimization problem that is hardly handled by existing algorithms and solvers in practice.

ONLINE ALGORITHM

We take the data size m p i and data aggregation ratio αj as input of our algorithms. In order to get their values, we need to wait all mappers to finish before starting reduce tasks, or conduct estimation via profiling on a small set of data. In practice, map and reduce tasks may partially overlap in execution to increase system throughput, and it is difficult to estimate system parameters at a high accuracy for big data applications. These motivate us to design an online algorithm to dynamically adjust data partition and aggregation during the execution of map and reduce tasks.

4.2 MODULES:

SERVER CLIENTS:

DITRIBUTED DATA:

SHEDULING TASK:

NETWORK TRAFFIC TRACES:

MAPREDUCE TASK:

4.3 MODULE DESCRIPTION:

SERVER CLIENTS:

Client-server computing or networking is a distributed application architecture that partitions tasks or workloads between service providers (servers) and service requesters, called clients. Often clients and servers operate over a computer network on separate hardware. A server machine is a high-performance host that is running one or more server programs which share its resources with clients. A client also shares any of its resources; Clients therefore initiate communication sessions with servers which await (listen to) incoming requests.

DITRIBUTED DATA:

We develop a distributed algorithm to solve the problem on multiple machines in a parallel manner. Our basic idea is to decompose the original large-scale problem into several distributively solvable subproblems that are coordinated by a high-level master problem. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

SHEDULING TASK:

MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result. There is a shuffle step between map and reduce phase. In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications.

NETWORK TRAFFIC TRACES:

Network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combiner has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced. We tested the real network traffic cost in Hadoop using the real data source from latest dumps files in Wikimedia (http://dumps.wikimedia.org/enwiki/latest/). In the meantime, we executed our distributed algorithm using the same data source for comparison. Since our distributed algorithm is based on a known aggregation ratio _, we have done some experiments to evaluate it in Hadoop environment.

MAPREDUCE TASK:

We focus on MapReduce performance improvement by optimizing its data transmission optimizing network usage can lead to better system performance and found that high network utilization and low network congestion should be achieved simultaneously for a job with good performance. MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

To overcome this shortcoming, we have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. In addition to data partition, many efforts have been made on local aggregation, in-mapper combining and in-network aggregation to reduce network traffic within MapReduce jobs. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks have proposed a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

In this paper, we study the joint optimization of intermediate data partition and aggregation in MapReduce to minimize network traffic cost for big data applications. We propose a three-layer model for this problem and formulate it as a mixed-integer nonlinear problem, which is then transferred into a linear form that can be solved by mathematical tools. To deal with the large-scale formulation due to big data, we design a distributed algorithm to solve the problem on multiple machines. Furthermore, we extend our algorithm to handle the MapReduce job in an online manner when some system parameters are not given. Finally, we conduct extensive simulations to evaluate our proposed algorithm under both offline cases and online cases. The simulation results demonstrate that our proposals can effectively reduce network traffic cost under various network settings.
CHAPTER 9

Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care

05/08/201902/07/2019 by admin

Abstract—Intelligently extracting knowledge from social mediahas recently attracted great interest from the Biomedical andHealth Informatics community to simultaneously improve healthcareoutcomes and reduce costs using consumer-generated opinion.We propose a two-step analysis framework that focuses on positiveand negative sentiment, as well as the side effects of treatment, inusers’ forum posts, and identifies user communities (modules) andinfluential users for the purpose of ascertaining user opinion ofcancer treatment. We used a self-organizing map to analyze wordfrequency data derived from users’ forum posts. We then introduceda novel network-based approach for modeling users’ foruminteractions and employed a network partitioning method based onoptimizing a stability qualitymeasure.This allowed us to determineconsumer opinion and identify influential users within the retrievedmodules using information derived frombothword-frequency dataand network-based properties. Our approach can expand researchinto intelligently mining social media data for consumer opinionof various treatments to provide rapid, up-to-date information forthe pharmaceutical industry, hospitals, and medical staff, on theeffectiveness (or ineffectiveness) of future treatments.Index Terms—Datamining, complex networks, neural networks,semantic web, social computing.I. INTRODUCTIONSOCIAL media is providing limitless opportunities for patientsto discuss their experiences with drugs and devices,and for companies to receive feedback on their products andservices [1]–[3]. Pharmaceutical companies are prioritizing socialnetwork monitoring within their IT departments, creatingan opportunity for rapid dissemination and feedback of productsand services to optimize and enhance delivery, increase turnoverand profit, and reduce costs [4]. Social media data harvestingfor bio-surveillance have also been reported [5].Social media enables a virtual networking environment.Modelingsocial media using available network modeling and computationaltools is one way of extracting knowledge and trendsfrom the information ‘cloud:’ a social network is a structuremade of nodes and edges that connect nodes in various relationships.Graphical representation is the most common methodto visually represent the information. Network modeling couldManuscript received January 24, 2014; revised May 4, 2014 and June 19,2014; accepted June 30, 2014. Date of publication July 10, 2014; date of currentversion December 30, 2014.A. Akay and B.-E. Erlandsson are with the School of Technology andHealth, Royal Institute of Technology, Stockholm SE-141 52, Sweden (e-mail:altu@kth.se; bjorn-erik.erlandsson@sth.kth.se).A. Dragomir, is with the Department of Biomedical Engineering, Universityof Houston, Houston, TX 77204–5060 USA (e-mail: adragomir@uh.edu).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JBHI.2014.2336251also be used for studying the simulation of network propertiesand its internal dynamics.A sociomatrix can be used to construct representations ofa social network structure. Node degree, network density, andother large-scale parameters can derive information about theimportance of certain entities within the network. Such communitiesare clusters or modules. Specific algorithms can performnetwork-clustering, one of the fundamental tasks in networkanalysis. Detecting particular user communities requires identifyingspecific, networked nodes that will allow informationextraction. Healthcare providers could use patient opinion toimprove their services. Physicians could collect feedback fromother doctors and patients to improve their treatment recommendationsand results. Patients could use other consumers’knowledge in making better-informed healthcare decisions.The nature of social networks makes data collection difficult.Several methods have been employed, such as link mining [6],classification through links [7], predictions based on objects[8], links [9], existence [10], estimation [11], object [12], group[13], and subgroup detection [14], and mining the data [15],[16]. Link prediction, viral marketing, online discussion groups(and rankings) allow for the development of solutions based onuser feedback.Traditional social sciences use surveys and involve subjectsin the data collection process, resulting in small sample sizes perstudy.With social media, more content is readily available, particularlywhen combined with web-crawling and scraping softwarethat would allow real-time monitoring of changes withinthe network.Previous studies used technical solutions to extract user sentimenton influenza [17], technology stocks [18], context andsentence structure [19], online shopping [20], multiple classifications[21], government health monitoring [22], specific termsrelating to consumer satisfaction [23], polarity of newspaper articles[24], and assessment of user satisfaction from companies[25], [26]. Despite the extensive literature, none have identifiedinfluential users, and how forum relationships affect networkdynamics.In the first stage of our current study, we employ exploratoryanalysis using the self-organizing maps (SOMs) to assess correlationsbetween user posts and positive or negative opinionon the drug. In the second stage, we model the users and theirposts using a network-based approach. We build on our previousstudy [27] and use an enhanced method for identifying usercommunities (modules) and influential users therein. The currentapproach effectively searches for potential levels of organization(scales) within the networks and uncovers dense modules2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 211Fig. 1. Processing tree in Rapidminer to ascertain the TF-IDF scores of wordsin the datausing a partition stability quality measure [28]. The approach enablesus to find the optimal network partition. We subsequentlyenrich the retrieved modules with word frequency informationfrom module-contained users posts to derive local and globalmeasures of users opinion and raise flag on potential side effectsof Erlotinib, a drug used in the treatment of one of the mostprevalent cancers: lung cancer [29].II. METHODSA. Initial Data Search and CollectionWe first searched for the most popular cancer message boards.We initially focused on the number of posts on lung cancer. Thechart below gives the number of posts of lung cancer per forum:Forums Posts on Lung CancerCancer-forums.net 36 051cancerforums.net 34 328forums.stupidcancer.org 17csn.cancer.org/forum 7959We chose lung cancer because, according to the most recentstatistics, it is the most commonly diagnosed cancer in theworld for both sexes [30], and the second most prevalent cancerin the US between both sexes [31], [32]. We then compiled alist of drugs used by lung cancer patients to ascertain whichdrug was the most discussed in the forums. The drug Erlotinib(trade name Tarceva) was the most frequently discussed drugin the message boards. A further search revealed that Cancerforums.net, despite having slightly fewer posts on lung cancer, hadmore posts dedicated to Erlotinib than the other three messageboards mentioned above.Next, we performed a search of the drug, using both thetrade name (Tarceva) and drug name (Erlotinib). The trade namegarnered more results (498) compared to the drug name (66).The search using the trade name returned 920 posts, from 2009to the present date.B. Initial Text Mining and PreprocessingA Rapidminer (www.rapidminer.com) [33] data collectionand processing tree was developed to look for the most commonpositive and negative words, and their term-frequency-inversedocument frequency (TF-IDF) scores within each post. Fig. 1shows the data collection and processing tree. We initially uploadedthe data into the first component (‘Read Excel’). Theuploaded data was then processed in the second component(‘Process Documents to Data’) using several subcomponents(‘Extract Content’, ‘Tokenize’, ‘Transform Cases’, ‘Filter Stopwords’,‘Filter Tokens,’ respectively) that filtered excess noise(misspelled words, common stop words, etc.) to ensure a uniformset of variables that can be measured. The final component(‘Processed Data’) contained the final word list, with each wordcontaining a specific TF-IDF score.We then assigned weights for each of the words found in theuser posts using with the following formula:weighti,d_log tfi,d + 1) log nxt0if tft,d ≥ 1otherwisein which tfi,d represents the word frequency (t) in the document(d), n represents the number of documents within the entirecollection, and xt represents the number of documents where toccurs [30].C. Cataloging and Tagging Text DataText data containing the highest TF-IDF scores were taggedwith a modified NLTK toolkit (http://www.nltk.org/) [34] usingMATLAB to ensure that they reflected the negativity of a negativeword and the positivity of a positive word in context. Thisapproach was used before using negative tags on positive words[35]. We added a positive tag on negative words. We used theNLTK toolkit for the analysis, and classification, of words tomatch their exact meanings within the contextual settings. Forexample, the context should be considered in phrases such as ‘Ido not feel great’ so that the term ‘great’ would be adequatelytagged as a negative one (in our case it is tagged as ‘great_n’before it is returned to its specific position). Das and Chen useda similar approach in classifying words [18]. We went one stepfurther and considered positive tag on negative words. A sentencethat states ‘No side effects so I am happy!’ resulted in theword ‘No’ being tagged as ‘No_p’ (reflecting its positive context)before it is returned to its specific position. These taggedwords were thus reclassified based on the context of the post.We then reduced the number of similar words, both manually(checking the words using online dictionaries such asMerriam-Webster (http://www.merriam-webster.com/), and automatically(synonym database software such as the ThesaurusSynonym Database (http://www.language-databases.com/) andGoogle’s synonym search finder.Our finalwordlistwas pruned using the aforementioned methods,with the results displayed in Table I, with the division ofboth positive and negative words.We eliminated each word that appeared less than ten times.This allowed us to achieve a uniform set of measurements whileeliminating statistically insignificant outliers. The end resultwas a modified wordlist of 110 words (55 positive words and55 negative words) shown in Table I.In a parallel procedure, we automatically browsed the userposts to look for side effects of Erlotinib. To this goal, weused the National Library of Medicine’s Medical SubjectHeading (MeSH), which is a controlled vocabulary212 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IFINAL POSTANALYSIS WORDLISTPositive NegativeAgree BadAppreciate CannotBeneficial ConcernBenefit ConcernsComfort DamageComfortable DangerousEase DepressionEasier DidnEffective DiedEnjoy DifficultFavorable DiscomfortFavorably Don’tFeasible DoubtGood ErrorGrateful FailureGreat FearGreater HardGreatest HasnGreatly HateHelp HurtHelped ImpossibleHelpful IsnHelping LackHelps LimitedHonest LoseHonestly LossHope LostHoped MissHopeful NastyHopefully NauseaHopes NegativeHoping NoImportance NotImportant PainImportantly PainfulImpresses PoorImprove ProblemImproved ProblemsImprovement SadImproves SacredImproving ScaryInspiration SevereLike SorryLove SucksLoved SufferPositive SufferingRight TerribleSuccess UnableSuccessful UnfortunatelySupport WasnThank WeakThanks WorriedUseful WorseWell WorstWonderful Wrong(http://www.nlm.nih.gov/mesh/) that consists of a hierarchy ofdescriptors and qualifiers that are used to annotate medicalterms. A custom designed program was used to map wordsin the forum to the MeSH database. A list of words present inforum posts that were associated to treatment side effects wasthus compiled. This was done by selecting the words simultaneouslyannotated with a specific list of qualifiers in MeSH (CI– chemically induced; CO – complications; DI – diagnosis; PA– pathology, and PP – physiopathology).We then compared theTABLE IIFINAL SIDE EFFECTS WORDLISTAcneCachexiaHeadachesItchingLesionPneumoniaRashTremorWeaknessVomitingFig 2. Thread model where nodes represent users/posts and the edges representinformation transferred among users.full list of side effect words with the results that were fed into theRapidminer processing tree: we kept the side effect words withthe highest TF-IDF scores (ensuring that each word appeared atleast ten times in the forum posts).Table II shows the final wordlist of the side effects. We subjectedthe initial side effect wordlist with the same methods thatwere used in Table I.After these preprocessing steps, our forum data was representedas two sets of vectors containing the TF-IDF scores ofthe words in the two wordlist. Namely, each user post in theforum was thus transformed into a vector of 110 variables representingthe TF-IDF scores of positive and negative words, anda 10 variable vector containing the TF-IDF scores correspondingto the side effect terms (see Fig. 2, steps A1-A3).D. Consumer Sentiment Using a SOMFor this part of the analysis, all posts were manually labeledaccording to the general user opinion observed within the postas positive and negative before feeding the collected data forexploratory analysis via SOMs. The manual labeling allowed usto use this as a method of results validation.SOMs are neural networks that produce low-dimensional representationof high-dimensional data [33]. Within this network,a layer represents output space with each neuron assigned a specificweight. The weight values reflect on the cluster content.The SOM displays the data to the network, bringing togethersimilar data weights to similar neurons.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 213The benefits and capabilities have been demonstrated wheredespite the reduction of the space size, the information, andidentification schema of the clusters remained the same [36].When new data is fed into the network, the closest weightsmatching the data change to reflect the new data. The neuronsfarther from the new data rarely change. This process continuesuntil data is no longer fed, resulting in a two-dimensional map.The SOM toolbox (www.cis.hut.fi/projects/somtoolbox) [37]was used and the SOM was fed with our first wordlist (seeTable I) TF-IDF vectors. The purposewas to assess the existenceof clusters in the data and howtheSOMweights of these clusterswould correlate to positive and negative opinion. The SOMwas trained using various map sizes, using quantization andtopographic errors as validation measures. The former is theresult of the average distance between every input vector andits best matching neuron (BMN), in addition to measuring howthe trained map fits into the input data [33]. The latter uses thestructure of the map to preserve its topology by representingits accuracy: it is calculated using the proportion of the weightsfor the first and second BMNs are farther than required formeasuring the topology.The best map size was based on the minimum values of thequantization (0.24) and topographic (10−5) errors. The wordlistdata was mapped and the emerging weights were analyzed forpositive and negative variable correlations of thewordlist.Wordsof no interest, and groups containing three or fewer words, wereeliminated.Subgroups were visually identified and analyzed for furtherinformation on the consumer opinion of Erlotinib.E. Modeling Forum Postings Using Network AnalysisDiscovering influential users was the next step in our analysis.To this goal, we built networks from forum posts andtheir replies, while accounting for content-based grouping ofposts resulting from the existing forum threads. Networks arecomposed of nodes and their connections: they are either nondirectional(a connection between two nodes without a direction)or directional (a connection with an origin and an end). Thenodal degree of the latter measures the number of connectionsfrom the origin to the destination. Four node types have beenidentified [38] within a network: Isolated, transmitter, receptor,and carrier. The network’s density measures the current numberof connections.The network-based analysis is widely used in social networkanalysis based on its ability to both model and analyze intersocialdynamics. We devised a directional network model due tothe nature of the forum under scrutiny (multiple threads withmultiple thread initiators) and its internal dynamics among themembers (members reply to thread initiators as well as to otherusers). Fig. 2 describes the approach we chose to build ournetwork, which shows how each posting-reply pair is modeled.Based on the nature of the forum, all of the posters within eachthread are context posters for the thread initiator (e.g., Node 1 isthe thread initiator in Fig. 2 and Nodes 2, 3, 4, and 5 in representcontext posters). Thus, all of the posters receive an incomingedge from the thread initiator. Some context posters respondFig. 3. Diagram describing the framework of our network-based analysis.First, the posts collected from the forum via Rapidminer are preprocessed usingthe NTLK Toolbox (Step A1) and transformed into two wordlists (Step A2). Forthis step, direct mapping to the MeSH vocabulary is used to identify words representingside-effects Based on the two wordlists, forum posts are transformedinto numerical vectors containing word-frequency based TF-IDF scores (StepA3). In parallel, forum posts and replies aremodeled as a directed network (StepB1). Obtained network is further refined to identify communities/modules ofhighly interacting users, based on the MCSD method [28] (Step B2). Finally,the two wordlist vectors datasets (their info reflecting the forum informationcontent) are overlaid onto the network modules to identify influential users andhighlight side-effects intensively discussed within the modules, respectively(Step B3).directly to another poster, using the forum option ‘Reply.’ Weused bidirectional edges to reflect the ensuing information transferfrom the poster to the replier and vice versa (in Fig. 2, Node5 is a direct replier to Node 4, as is Node 3 to Node 2). Thisuser-interactionmodel allowed us to build a network that reflectsfaithfully the information content of the forum.F. Identifying SubgraphsOur modeling framework has consequently converted the forumposts into several large directional networks containing anumber of densely connected units (or modules) (see Fig. 3,step A1). These modules have the characteristic that they aremore densely connected internally (within the unit) than externally(outside the unit). We chose a multiscale method thatuses local and global criteria for identifying the modules, whilemaximizing a partition quality measure called stability [28].The stability measure considers the network as a Markovchain, with nodes representing states and edges being possibletransitions among these states. In [28], the authors proposed anapproach in which transition probabilities for a random walk oflength t (t being the Markov time) enable multiscale analysis.With increasing scale t, larger and larger modules are found.The stability of a walk of length t can be expressed asQMt =12m_i,j_Ati , j− didj2m_∗ δ (i, j) (1)where At is the adjacency matrix, t is the length of the network,m is the number of edges, i and j are nodes, di is node i’s (and j’s)strength, and δ (i,j) function becomes one if one of the nodesbelong to the same network and zero if it does not belong to214 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015any network. At is computed as follows (in order to accountfor the random walk): At = D ·Mt , where M = D−1 · A (Dbeing the diagonal matrix containing the degree vector givingfor each node its degree) [28].The method for identifying the optimal modules is based onalternating local and global criteria that expand modules byadding neighbor nodes, reassigning nodes to different modules,and significantly overlapping modules until no further optimizationis feasible, according to (1). The approach follows similarmethods presented in [28], [39], and [40].Several partitioning schemes were obtained pending on therange of scales employed by the method, with the optimal partitioninghaving the largest stability. We named the modules thusretrieved information modules (see step A2 in Fig. 3).G. Module Average Opinion and User Average OpinionWe then proceeded to refine the information modules throughfeeding them with the information obtained from the forumposts (using the wordlist vectors). In a first step, we aimedat identifying influential users within our networks. Influentialusers are users which broker most of the information transferwithin network modules and whose opinion in terms of positiveor negative sentiment towards the treatment is ‘spread’ tothe other users within their containing modules. To this goal,we enriched the information modules obtained as described inSection II-F with the TF-IDF scores of the user posts correspondingto the users found in each module. The TF-IDF scoresfrom the wordlist of positive and negative words (see Table I)were used to build two forms of measurement. The global measure(pertaining to the whole informationmodule) is representedby the module average opinion (MAO). It examined the TF-IDFscores of postings matching the nodes in a specific moduleMAO =Sum+ − Sum−Sumall.Sum+ =__xij is the total sum of the TF-IDF scoresmatching the positive words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the positive words in the list).Sum− =__xij is the total sum of the TF-IDF scoresmatching the negative words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the negative words in the list).Sumall =_Ni=1_Mk=1 xik is the sum of both of the aforementionedsums. The unit k is the index running across variablesthroughout the entire wordlist.The local measure that illustrates specific user opinion toeach node in the module (the user average opinion, or UAO)that examines the TF-IDF scores to the average of the collectedposts of the user is the following:UAOi =Sumi+ − Sumi−Sumiall.Sumi+ =_j∈P xij is the TF-IDF score’s sum matching topositive words for the ith user’s wordlist vector. P is the indexset denoting the wordlist’s positive variables.Fig. 4. U-Matrix of the posts from the Cancerforums.net forum.Sumi− =_j∈N xij is the TF-IDF score’s sum matching tonegative words for the ith user’s wordlist vector. N is the indexset denoting the wordlist’s negative variables.Sumall =_Mj=1 xij is the total of both sums, and j is theindex of the whole wordlist.H. Information Brokers Within the Information ModulesWe first ranked individual nodes in terms of their total numberof connecting edges (in and out-degree) to identify influentialusers within the modules.We then looked nodes in each module based on the followingcriteria:1) The nodes have densest degrees within the module (highestnumber of edges).2) The UAO scores equate the signs of the MAO of thecontaining module.The nodes that qualified were dubbed information brokers,based on the aforementioned criteria. Their large nodal degreesensure increased information transfer compared to other nodeswhile their matching UAO and MAO scores reflect consistencyof positive or negative opinion within the containing module.I. Network-Based Identification of Side EffectsIn the second step of our network-based analysis, we devised astrategy for identifying potential side effects occurring duringthe treatment and which user posts on the forum highlight. Tothis goal, we overlay the TF-IDF scores of the second wordlist(see Table II) onto modules obtained in Section II-F. The TFIDFscores within each module will thus directly reflect howfrequent a certain side-effect is mentioned in module posts.Subsequently, a statistical test (such as the t-test for example)can be used to compare the values of the TF-IDF scores withinthe module to those of the overall forum population and identifyvariables (side-effects) that have significantly higher scores.Fig. 3 presents a diagram that visually describes the steps inour network-based analysis.III. RESULTSFig. 4 shows the unified matrix resulting from the SOM analysisfor the wordlist vectors corresponding to the positive andnegative terms from the message board Cancerforums.net. Asubset consisting of 30% of the data was used for training theSOM. We used a 12 × 12 map size with 110 variables correspondingto the positive and negative terms to ascertain theAKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 215TABLE IIIUSER OPINION OF ERLOTINIBSatisfaction Dissatisfaction70 percent 30 percentBREAKDOWN OF USER OPINIONFully Satisfied (23) Full Dissatisfaction (4)Satisfied Despite Side Effects (37) Dissatisfaction because of Side Effects (20)Satisfied Despite Costs (10) Dissatisfaction because of Costs (6)weight of the words corresponded to the opinion of the drugErlotinib. As mentioned in Section II, each word from the listappeared more than ten times. This achieved a uniform measurementset while eliminating statistically insignificant outliers.Much of the user’s posts converged on three areas of the map.We checked the respective nodes’ correlation with their weightvectors’ values corresponding to positive or negative words todefine the positive and negative areas of the map.The user opinion of Erlotinib was overall satisfactory, withTable III summarizing the satisfaction/dissatisfaction below:According to chart, and from our readings of both the userposts and the SOM, the most pressing concern from both campswas the side effects, which are extensively documented in themedical literature [41]–[46]. The costs of the drug were alsoanother matter of concern (albeit limited).We then proceeded to identify influential users. Our modelingapproach yielded initially a single loosely connected network,linking all users within the forum. Subsequent module identificationusing the methods described in Section II-F yielded anoptimal partitioning containing five densely connected module.We varied our scale parameter within the interval t _ [0,2] in0.1 increments, as suggested by [28]. Varying the scale parameterresulted in a set of partitions ranging from modules basedon single individual users (for scale parameter t = 0), to largemodules (for values of t close to the upper limit of the interval).The optimal partition (maximizing the quality measure in (1)was obtained for t = 1.On the Cancerforums.net message board, ten users out of the920 posts were identified as information brokers as shown inFig. 5(a)–(e) below.Densities of the retrieved modules range from 0.2 to 0.6.These density values were within the observed density valuesinterval (towards the upper limit), when compared to those generallynoted in social networks, thus confirming our networkmodeling approach [47], [48]. Information brokers were identifiedfollowing the procedure described in Sections II-G–H.Further scrutinizing these users and their containing moduleswe confirmed their connections were the densest. A thoroughreading of these ten users’ posts throughout the threads theystarted and participated in revealed that they were informativeand actively interacting with users across many threads. Othermembers sought out these ten posters for their wisdom andexperience. Their forum ‘behavior’ has confirmed to us thatthese users were the premier information brokers of the drugsErlotinib on the Cancerforums.net forum.Fig. 5. Ten users were identified as information brokers on the Cancerforums.net Forum. Modules in parts a)–e) show where these ten users reside inthe forum.216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IVSIDE-EFFECT FREQUENCY AND LOCATION IN SELECTED MODULESModule 1 (A) ‘rash’ (p − value < 0.01)‘itch’ (p − value < 0.05)Module 2 (B) ‘rash’ (p − value < 0.05)Module 5 (E) ‘rash’ (p − value < 0.01)In the last part of our analysis,we investigated whichmoduleswere significantly involved in discussing specific side effects.As described in Section II-I, retrieved modules were enrichedwith the TF-IDF scores corresponding to the side-effectwordlistvectors. For each module and each side-effect scores sample, ttestswereperformed to assess the significant difference betweenthe in-module sample and the overall forum population scores.Rash and itching were identified as the side effect terms withsignificantly higher scores in Modules 1, 2, and 5 when comparedto the overall scores population in the forum, as describedin Table IV. This reflects the fact that users grouped within thesemodules repeatedly discussed these side effects in their posts.This was confirmed by subsequent scrutiny of the respectiveposts. A literature search confirmed that rash and itching areindeed two of the most common side-effects of Erlotinib withas much as 70% of the patients affected, as indicated by clinicalstudies. [44]–[46]IV. DISCUSSIONWe converted a forum focused on oncology into weightedvectors to measure consumer thoughts on the drug Erlotinibusing positive and negative terms alongside another list containingthe side effects. Our methods were able to investigatepositive and negative sentiment on lung cancer treatment usingthe drug by mapping the large dimensional data onto a lowerdimensional space using the SOM. Most of the user data wasclustered to the area of themap linked to positive sentiment, thusreflecting the general positive view of the users. Subsequent networkbased modeling of the forum yielded interesting insightson the underlying information exchange among users. Modulesof strongly interacting users were identified using a multiscalecommunity detection method described in [28]. By overlayingthese modules with content-based information in the form ofword-frequency scores retrieved from user posts, we were ableto identify information brokers which seem to play importantroles in the shaping the information content of the forum. Additionally,we were able to identify potential side effects consistentlydiscussed by groups of users. Such an approach could beused to raise red flags in future clinical surveillance operations,as well as highlighting various other treatment related issues.The results have opened new possibilities into developing advancedsolutions, as well as revealing challenges in developingsuch solutions.The consensus on Erlotinib depends on individual patientexperience. Social media, by its nature, will bring different individualswith different experiences and viewpoints. We siftedthrough the data to find positive and negative sentiment, whichwas later confirmed by research that emerged regarding Erlotinib’seffectiveness and side effects. Future studies will requiremore up-to-date information for a clearer picture of userfeedback on drugs and services.Future solutions will require more advanced detection of intersocialdynamics and its effects on the members: such interestsof study may include rankings, ‘likes’ of posts, and friendships.Further emphasis on context posting will require formal languagedictionaries that include medical terms for specific diseases,and informal language terms (‘slang’) to clarify posts.Finally, different platforms will allow up-to-date informationon the status of the drug in case one social platform ceases todiscuss the drug. Another solution can look at multiple wordliststhat can include multiple treatments that, when combined withcontextual posting and medical lexical dictionaries, can pinpointthe source (or multiple sources) of user satisfaction (ordissatisfaction), which can open the door towards mapping consumersentiment of multidrug therapies for advanced diseases.The combined solutions can open newavenues of postmarketingsurveillance research as companies seek real-time, ‘intelligent’data of their products and services to remain competitive.This solution can be envisioned on future medical devicesthat can serve as postmarketing feedback loop that consumerscan use to express their satisfaction (or dissatisfaction) directlyto the company. The company benefits from real-time feedbackthat can then be used to assess if there are any problems andrapidly address such problems.Social media can open the door for the health care sector inaddress cost reduction, product and service optimization, andpatient care.

Neighbor Similarity Trust against Sybil Attack in P2P E-Commerce

05/08/201902/07/2019 by admin

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

1.2 INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

1.3 LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

2.1.0 DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

Fake Users Enters Easy.
This makes Sybil attacks.

2.2 PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

2.2.0 ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

It is Helpful to find Sybil Attacks.
It is used to Find Fake UserID.
It is feasible to limit the number of attack edges in online social networks by relationship rating.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.0 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.0 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tools : Netbeans 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGNS

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

LEVEL 0:

Neighbor Nodes

Source

LEVEL 1:

P2P Sybil Trust Mode

Send Data Request

LEVEL 2:

Data Receive

P2P ACK

Active Attack (Malicious Node)

Send Data Request

LEVEL 3:

3.3 UML DIAGRAMS

3.3.0 USECASE DIAGRAM:

SERVER CLIENT

3.3.1 CLASS DIAGRAM:

3.3.2 SEQUENCE DIAGRAM:

3.4 ACITVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group peers join a group; they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has. Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating. The method of detection of Sybil attack is depicted in Fig. 2. A1 and A2 refer to the same peer but with different identities.

Our approach, the identifiers are only propagated by the peers who exhibit neighbor similarity trust. Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers. SybilTrust proposes that an honest peer should not have an excessive number of neighbors. The neighbors we refer should be member peers existing in a group. The restriction helps to bind the number of peers against any additional attack among the neighbors. If there are too many neighbors, SybilTrust will (internally) only use a subset of the peer’s edges while ignoring all others. Following Liben-Nowell and Kleinberg, we define the attributes of the given pair of peers as the intersection of the sets of similar products.

4.1 MODULES:

SIMILARITY TRUST RELATIONSHIP:

NEIGHBOR SIMILARITY TRUST:

DETECTION OF SYBIL ATTACK:

SECURITY AND PERFORMANCE:

4.2 MODULES DESCRIPTION:

SIMILARITY TRUST RELATIONSHIP:

We focus on the active attacks in P2P e-commerce. When a peer is compromised, all the information will be extracted. In our work, we have proposed use of SybilTrust which is based on neighbor similarity relationship of the peers. SybilTrust is efficient and scalable to group P2P e-commerce network. Sybil attack peers may attempt to compromise the edges or the peers of the group P2P e-commerce. The Sybil attack peers can execute further malicious actions in the network. The threat being addressed is the identity active attacks as peers are continuously doing the transactions in the peers to show that each controller only admitted the honest peers.

Our method makes assumptions that the controller undergoes synchronization to prove whether the peers which acted as distributor of identifiers had similarityor not. If a peer never had similarity, the peer is assumed to have been a Sybil attack peer. Pairing method is used to generate an expander graph with expansion factor of high probability. Every pair of neighbor peers share a unique symmetric secret key (the edge key), established out of band for authenticating each other peers may deliberately cause Byzantine faults in which their multiple identity and incorrect behavior ends up undetected.

The Sybil attack peers can create more non-existent links. The protocols and services for P2P, such as routing protocols must operate efficiently regardless of the group size. In the neighbor similarity trust, peers must have a self-healing in order to recover automatically from any state. Sybil attack can defeat replication and fragmentation performed in distributed hash tables. Geographic routing in P2P can also be a routing mechanism which can be compromised by Sybil peers.

NEIGHBOR SIMILARITY TRUST:

We present a Sybil identification algorithm that takes place in a neighbor similarity trust. The directed graph has edges and vertices. In our work, we assume V is the set of peers and E is the set of edges. The edges in a neighbor similarity have attack edges which are safeguarded from Sybil attacks. A peer u and a Sybil peer v can trade whether one is Sybil or not. Being in a group, comparison can be done to determine the number of peers which trade with peer. If the peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes a peer existing in a group has six types of keys. The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete. Our algorithm adaptively tests the suspected peer while maintaining the neighbor similarity trust connection based on time.

DETECTION OF SYBIL ATTACK:

Sybil attack, a malicious peer must try to present multiple distinct identities. This can be achieved by either generating legal identities or by impersonating other normal peers. Some peers may launch arbitrary attacks to interfere with P2P e-commerce operations, or the normal functioning of the network. According to an attack can succeed to launch a Sybil attack by:

_ Heterogeneous configuration: in this case, malicious peers can have more communication and computation resources than the honest peers.

_ Message manipulation: the attacker can eavesdrop on nearby communications with other parties. This means a attacker gets and interpolates information needed to impersonate others. Major attacks in P2P e-commerce can be classified as passive and active attacks.

_ Passive attack: It listens to incoming and outgoing messages, in order to infer the relevant information from the transmitted recommendations, i.e., eavesdropping, but doesn’t harm the system. A peer can be in passive mode and later in active mode.

_ Active attack: When a malicious peer receives a recommendation for forwarding, it can modify, or when requested to provide recommendations on another peer, it can inflate or bad mouth. The bad mouthing is a situation where a malicious peer may collude with other malicious peers to revenge the honest peer. In the Sybil attack, a malicious peer generates a large number of identities and uses them together to disrupt normal operation.

SECURITY AND PERFORMANCE:

We evaluate the performance of the proposed SybilTrust. We measure two metrics, namely, non-trustworthy rate and detection rate. Non-trustworthy rate is the ratio of the number of honest peers which are erroneously marked as Sybil/malicious peer to the number of total honest peers. Detection rate is the proportion of detected Sybil/ malicious peers to the total Sybil/malicious peers. Communication Cost. The trust level is sent with the recommendation feedback from one peer to another. If a peer is compromised, the information is broadcasted to all peers as revocation of the trust level is being done. Computation Cost. The sybilTrust approach is efficient in the computation of polynomial evaluation. The calculation of the trust level evaluation is based on a pseudo-random function (PRF). PRF is a deterministic function.

In our simulation, we use C# .NET tool. Each honest and malicious peer interacted with a random number of peers defined by a uniform distribution. All the peers are restricted to the group. Our approach, P2P e-commerce community has a total of 3 different categories of interest. The transaction interactions between peers with similar interest can be defined as successful or unsuccessful, expressed as positive or negative respectively. The impact of the first two parameters on performance of the mechanism is evaluated in the percentage of malicious peers replied is randomly chosen by each malicious peer. Transactions with 10 to 40 percent malicious peers are done.

Our SybilTrust approach detects more malicious peers compared to Eigen Trust and Eigen Group Trust [26] as shown in Fig. 4. Fig. 4. shows the detection rates of the P2P when the number of malicious peers increases. When the number of deployed peers is small, e.g., 40 peers, the chance that no peers are around a malicious peer is high. Fig. 4 illustrates the variation of non-trustworthy rates of different numbers of honest peers as the number of malicious peer increases. It is shown that the non-trustworthy rate increases as the number of honest peers and malicious peers increase. The reason is that when there are more malicious peers, the number of target groups is larger. Moreover, this is because neighbor relationship is used to categorize peers in the

We proposed approach. The number of target-groups also increases when the number of honest peers is higher. As a result, the honest peers are examined more times, and the chance that an honest peer is erroneously determined as a Sybil/malicious peer increases, although more Sybil attack peer can also be identified. Fig. 4 displays the detection rate when the reply rate of each malicious peer is the same. The detection rate does not decrease when the reply rate is more than 80 percent, because of the enhancement.

The enhancement could still be found even when a malicious peer replies to almost all of its Sybil attack peer requests. Furthermore, the detection rate is higher as the number of malicious peers becomes more, which means the proposed mechanism is able to resist the Sybil attack from more malicious peers. The detection rate is still more than 80 percent in the sparse network, which according to the definition of a sparse network detection rate reaches 95 percent when the number of legitimate nodes is 300. It is also because the number of target groups increases as the number of malicious peer’s increases and the honest peers are examined more times. Therefore, the rate that an honest peer is erroneously identified as a Sybil/malicious peer also increases.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET Framework is a language-neutral platform for writing programs that can easily and securely interoperate. There’s no language barrier with .NET: there are numerous languages available to the developer including Managed C++, C#, Visual Basic and Java Script.

The .NET framework provides the foundation for components to interact seamlessly, whether locally or remotely on different platforms. It standardizes common data types and communications protocols so that components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

The code that targets .NET, and which contains certain extra Information – “metadata” – to describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed code contains the information that allows the CLR to guarantee, for instance, safe execution and interoperability.

Managed Data

With Managed Code comes Managed Data. CLR provides memory allocation and Deal location facilities, and garbage collection. Some .NET languages use Managed Data by default, such as C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR can, depending on the language you’re using, impose certain constraints on the features available. As with managed and unmanaged code, one can have both managed and unmanaged data in .NET applications – data that doesn’t get garbage collected but instead is looked after by unmanaged code.

Common Type System

The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety. This ensures that all classes are compatible with each other, by describing types in a common way. CTS define how types work within the runtime, which enables types in one language to interoperate with types in another language, including cross-language exception handling. As well as ensuring that types are only used in appropriate ways, the runtime also ensures that code doesn’t attempt to access memory that hasn’t been allocated to it.

Common Language Specification

The CLR provides built-in support for language interoperability. To ensure that you can develop managed code that can be fully used by developers using any programming language, a set of language features and rules for using them called the Common Language Specification (CLS) has been defined. Components that follow these rules and expose only CLS features are considered CLS-compliant.

7.3 THE CLASS LIBRARY

.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as well as Object. All objects derive from System. Object. As well as objects, there are value types. Value types can be allocated on the stack, which can provide useful flexibility. There are also efficient means of converting value types to object types if and when necessary.

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

The multi-language capability of the .NET Framework and Visual Studio .NET enables developers to use their existing programming skills to build all types of applications and XML Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to the family.

Visual Basic .NET has been updated to include many new and improved language features that make it a powerful object-oriented programming language. These features include inheritance, interfaces, and overloading, among others. Visual Basic also now supports structured exception handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Managed Extensions for C++ and attributed programming are just some of the enhancements made to the C++ language. Managed Extensions simplify the task of migrating existing C++ applications to the new .NET Framework.

C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid Application Development”. Unlike other languages, its specification is just the grammar of the language. It has no standard library of its own, and instead has been designed with the intention of using the .NET libraries as its own.

Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the world of XML Web Services and dramatically improves the interoperability of Java-language programs with existing software written in a variety of other programming languages.

Active State has created Visual Perl and Visual Python, which enable .NET-aware applications to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET environment. Visual Perl includes support for Active State’s Perl Dev Kit.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

C#.NET is also compliant with CLS (Common Language Specification) and supports structured exception handling. CLS is set of rules and constructs that are supported by the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET Framework; it manages the execution of the code and also makes the development process easier by providing services.

C#.NET is a CLS-compliant language. Any objects, classes, or components that created in C#.NET can be used in any other CLS-compliant language. In addition, we can use objects, classes, and components created in other CLS-compliant languages in C#.NET .The use of CLS ensures complete interoperability among applications, regardless of the languages used to create the application.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them. In other words, destructors are used to release the resources allocated to the object. In C#.NET the sub finalize procedure is available. The sub finalize procedure is used to complete the tasks that must be performed when an object is destroyed. The sub finalize procedure is called automatically when an object is destroyed. In addition, the sub finalize procedure can be called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION

Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated resources, such as objects and variables. In addition, the .NET Framework automatically releases memory for reuse by destroying objects that are no longer in use.

In C#.NET, the garbage collector checks for the objects that are not currently in use by applications. When the garbage collector comes across an object that is marked for garbage collection, it releases the memory occupied by the object.

OVERLOADING

Overloading is another feature in C#. Overloading enables us to define multiple procedures with the same name, where each procedure has a different set of arguments. Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

C#.NET also supports multithreading. An application that supports multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease the time taken by an application to respond to user interaction.

STRUCTURED EXCEPTION HANDLING

C#.NET supports structured handling, which enables us to detect and remove errors at runtime. In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using Try…Catch…Finally statements, we can create robust and effective exception handlers to improve the performance of our application.

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000 Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component. The Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data Services. The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the question from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset. Access either displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION AND FUTURE:

We presented SybilTrust, a defense against Sybil attack in P2P e-commerce. Compared to other approaches, our approach is based on neighborhood similarity trust in a group P2P e-commerce community. This approach exploits the relationship between peers in a neighborhood setting. Our results on real-world P2P e-commerce confirmed fastmixing property hence validated the fundamental assumption behind SybilGuard’s approach. We also describe defense types such as key validation, distribution, and position verification. This method can be done at in simultaneously with neighbor similarity trust which gives better defense mechanism. For the future work, we intend to implement SybilTrust within the context of peers which exist in many groups. Neighbor similarity trust helps to weed out the Sybil peers and isolate maliciousness to specific Sybil peer groups rather than allow attack in honest groups with all honest peers.

Maximizing P2P File Access Availability in Mobile Ad Hoc Networks though Replication for Efficient Fi

05/08/201902/07/2019 by admin

File sharing applications in mobile ad hoc networks (MANETs) have attracted more and more attention in recent years. The efficiency of file querying suffers from the distinctive properties of such networks including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica creation with minimum average querying delay.

Specifically, current file replication protocols in mobile ad hoc networks have two shortcomings. First, they lack a rule to allocate limited resources to different files in order to minimize the average querying delay. Second, they simply consider storage as available resources for replicas, but neglect the fact that the file holders’ frequency of meeting other nodes also plays an important role in determining file availability. Actually, a node that has a higher meeting frequency with others provides higher availability to its files. This becomes even more evident in sparsely distributed MANETs, in which nodes meet disruptively.

In this paper, we introduce a new concept of resource for file replication, which considers both node storage and node meeting ability. We theoretically study the influence of resource allocation on the average querying delay and derive an optimal file replication rule (OFRR) that allocates resources to each file based on its popularity and size. We then propose a file replication protocol based on the rule, which approximates the minimum global querying delay in a fully distributed manner. Our experiment and simulation results show the superior performance of the proposed protocol in comparison with other representative replication protocols.

1.2 INTRODUCTION

With the increasing popularity of mobile devices, e.g., smartphones and laptops, we envision the future of MANETs consisted of these mobile devices. By MANETs, we refer to both normal MANETs and disconnected MANETs, also known as delay tolerant networks (DTNs). The former has a relatively dense node distribution in an area while the latter has sparsely distributed nodes that meet each other opportunistically. On the other side, the emerging of mobile file sharing applications on the peer-to-peer (P2P) file sharing over such MANETs. The local P2P file sharing model provides three advantages. First, it enables file sharing when no base stations are available (e.g., in rural areas). Second, with the P2P architecture, the bottleneck on overloaded servers in current clientserver based file sharing systems can be avoided. Third, it exploits otherwise wasted peer to peer communication opportunities among mobile nodes. As a result, nodes can freely and unobtrusively access and share files in the distributed MANET environment, which can possibly support interesting applications.

For example, mobile nodes can share files based on users’ proximity in the same building or in a local community. Tourists can share their travel experiences or emergency information with other tourists through digital devices directly even when no base station is available in remote areas. Drivers can share road information through the vehicle-to-vehicle communication. However, the distinctive properties of MANETs, i.e., node mobility, limited communication range and resource, have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range. Broadcasting can quickly discover files, but it leads to the broadcast storm problem with high energy consumption.

Probabilistic routing and file discovery protocols avoid broadcasting by forwarding a query to a node with higher probability of meeting the destination. But the opportunistic encountering of nodes in MANETs makes file searching and retrieval non-deterministic. File replication is an effective way to enhance file availability and reduce file querying delay. It creates replicas for a file to improve its probability of being encountered by requests. Unfortunately, it is impractical and inefficient to enable every node to hold the replicas of all files in the system considering limited node resources. Also, file querying delay is always a main concern in a file sharing system. Users often desire to receive their requested files quickly no matter whether the files are popular or not. Thus, a critical issue is raised for further investigation: how to allocate the limited resource in the network to different files for replication so that the overall average file querying delay is minimized? Recently, a number of file replication protocols have been proposed for MANETs. In these protocols, each individual node replicates files it frequently queries or a group of nodes create one replica for each file they frequently query. In the former, redundant replicas are easily created in the system, thereby wasting resources.

In the latter, though redundant replicas are reduced by group based cooperation, neighboring nodes may separate from each other due to node mobility, leading to large query delay. There are also some works addressing content caching in disconnected MANETs/ DTNs for efficient data retrieval or message routing. They basically cache data that are frequently queried on places that are visited frequently by mobile nodes. Both the two categories of replication methods fail to thoroughly consider that a node’s mobility affects the availability of its files. In spite of efforts, current file replication protocols lack a rule to allocate limited resources to files for replica creation in order to achieve the minimum average querying delay, i.e., global search efficiency optimization under limited resources. They simply consider storage as the resource for replicas, but neglect that a node’s frequency to meet other nodes (meeting ability in short) also influences the availability of its files. Files in a node with a higher meeting ability have higher availability.

1.3 LITRATURE SURVEY

CONTACT DURATION AWARE DATA REPLICATION IN DELAY TOLERANT NETWORKS

AUTHOR: X. Zhuo, Q. Li, W. Gao, G. Cao, and Y. Dai

PUBLISH: Proc. IEEE 19th Int’l Conf. Network Protocols (ICNP), 2011.

EXPLANATION:

The recent popularization of hand-held mobile devices, such as smartphones, enables the inter-connectivity among mobile users without the support of Internet infrastructure. When mobile users move and contact each other opportunistically, they form a Delay Tolerant Network (DTN), which can be exploited to share data among them. Data replication is one of the common techniques for such data sharing. However, the unstable network topology and limited contact duration in DTNs make it difficult to directly apply traditional data replication schemes. Although there are a few existing studies on data replication in DTNs, they generally ignore the contact duration limits. In this paper, we recognize the deficiency of existing data replication schemes which treat the complete data item as the replication unit, and propose to replicate data at the packet level. We analytically formulate the contact duration aware data replication problem and give a centralized solution to better utilize the limited storage buffers and the contact opportunities. We further propose a practical contact Duration Aware Replication Algorithm (DARA) which operates in a fully distributed manner and reduces the computational complexity. Extensive simulations on both synthetic and realistic traces show that our distributed scheme achieves close-to-optimal performance, and outperforms other existing replication schemes.

SOCIAL-BASED COOPERATIVE CACHING IN DTNS: A CONTACT DURATION AWARE APPROACH

AUTHOR: X. Zhuo, Q. Li, G. Cao, Y. Dai, B.K. Szymanski, and T.L. Porta,

PUBLISH: Proc. IEEE Eighth Int’l Conf. Mobile Adhoc and Sensor Systems (MASS), 2011.

EXPLANATION:

Data access is an important issue in Delay Tolerant Networks (DTNs), and a common technique to improve the performance of data access is cooperative caching. However, due to the unpredictable node mobility in DTNs, traditional caching schemes cannot be directly applied. In this paper, we propose DAC, a novel caching protocol adaptive to the challenging environment of DTNs. Specifically, we exploit the social community structure to combat the unstable network topology in DTNs. We propose a new centrality metric to evaluate the caching capability of each node within a community, and solutions based on this metric are proposed to determine where to cache. More importantly, we consider the impact of the contact duration limitation on cooperative caching, which has been ignored by the existing works. We prove that the marginal caching benefit that a node can provide diminishes when more data is cached. We derive an adaptive caching bound for each mobile node according to its specific contact patterns with others, to limit the amount of data it caches. In this way, both the storage space and the contact opportunities are better utilized. To mitigate the coupon collector’s problem, network coding techniques are used to further improve the caching efficiency. Extensive trace-driven simulations show that our cooperative caching protocol can significantly improve the performance of data access in DTNs.

SEDUM: EXPLOITING SOCIAL NETWORKS IN UTILITY-BASED DISTRIBUTED ROUTING FOR DTNS

AUTHOR: Z. Li and H. Shen

PUBLISH: IEEE Trans. Computers, vol. 62, no. 1, pp. 83-97, Jan. 2012.

EXPLANATION:

However, current probabilistic forwarding methods only consider node contact frequency in calculating the utility while neglecting the influence of contact duration on the throughput, though both contact frequency and contact duration reflect the node movement pattern in a social network. In this paper, we theoretically prove that considering both factors leads to higher throughput than considering only contact frequency. To fully exploit a social network for high throughput and low routing delay, we propose a Social network oriented and duration utility-based distributed multicopy routing protocol (SEDUM) for DTNs. SEDUM is distinguished by three features. First, it considers both contact frequency and duration in node movement patterns of social networks. Second, it uses multicopy routing and can discover the minimum number of copies of a message to achieve a desired routing delay. Third, it has an effective buffer management mechanism to increase throughput and decrease routing delay. Theoretical analysis and simulation results show that SEDUM provides high throughput and low routing delay compared to existing routing approaches. The results conform to our expectation that considering both contact frequency and duration for delivery utility in routing can achieve higher throughput than considering only contact frequency, especially in a highly dynamic environment with large routing messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

This work focuses on Delay Tolerant Networks (DTNs) in a social network environment. DTNs do not have a complete path from a source to a destination most of the time. Previous data routing approaches in DTNs are primarily based on either flooding or single-copy routing. However, these methods incur either high overhead due to excessive transmissions or long delays due to suboptimal choices for relay nodes. Probabilistic forwarding that forwards a message to a node with a higher delivery utility enhances single-copy routing.

Previous file sharing applications in mobile ad hoc networks (MANETs) have attracted more efficiency of file querying suffers from the distinctive properties of MANETs including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica sharing with minimum average querying delay communication links between mobile nodes are transient and network maintenance overhead is a major performance bottleneck for data transmission. Low node density makes it difficult to establish end-to-end connection, thus impeding a continuous end-to-end path between a source and a destination.

DTN networks for communication in outer space, but is now directly accessible from our pockets both the characteristics of MANETs and the requirements of P2P file sharing an application layer overlay network. We port a DTN type solution into an infrastructure-less environment like MANETs and leverage peer mobility to reach data in other disconnected networks. This is done by implementing an asynchronous communication model, store-delegate-and-forward, like DTNs, where a peer can delegate unaccomplished file download or query tasks to special peers. To improve data transmission performance while reducing communication overhead, we select these special peers by the expectation of encountering them again in future and assign them different download starting point on the file.

2.1.1 DISADVANTAGES:

Limited communication range and resource have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range.

The disadvantage is that it lacked of transparency. Receiving a URL explicitly points to certain data replica and that the browser will become aware of the switching between the different machines.
And for scalability, the necessity of making contact with is always the same, the single service machine can make it bottleneck as the number of clients increase which makes situation worse.

2.2 PROPOSED SYSTEM:

We propose a distributed file replication protocol that can approximately realize the optimal file replication rule with the two mobility models in a distributed manner in the OFRR in the two mobility models (i.e., Equations (22) and (28)) have the same form, we present the protocol in this section without indicating the specific mobility model. We first introduce the challenges to realize the OFRR and our solutions. We then propose a replication protocol to realize OFRR and analyze the effect of the protocol.

We propose the priority competition and split file replication protocol (PCS). We first introduce how a node retrieves the parameters needed in PCS and then present the detail of PCS. we briefly prove the effectiveness of PCS. We refer to the process in which a node tries to copy a file to its neighbors as one round of replica distribution. Recall that when a replica is created for a file with P, the two copies will replicate files with priority P =2 in the next round. This means that the creation of replicas will not increase the overall P of the file. Also, after each round, the priority value of each file or replica is updated based on the received requests for the file.

Then, though some replicas may be deleted in the competition, the total amount of requests for the file remains stable, making the sum of the Ps of all replicas and the original file roughly equal to the overall priority value of the file. Then, we can regard the replicas of a file as an entity that competes for available resource in the system with accumulated priority P in each round. Therefore, in each round of replica distribution, based on our design of PCS, the overall probability of creating a replica for an original file

2.2.1 ADVANTAGES:

The community-based mobility model has been used in content dissemination or routing algorithms for disconnected MANETs/DTNs to depict node mobility. In this model, the entire test area is split into different sub-areas, denoted as caves. Each cave holds one community.

RWP model, we can assume that the inter-meeting time among nodes follows exponential distribution. Then, the probability of meeting a node is independent with the previous encountered node. Therefore, we define the meeting ability of a node as the average number of nodes it meets in a unit time and use it to investigate the optimal file replication.

PCS, we used two routing protocols in the experiments. We first used the Static Wait protocol in the GENI experiment, in which each query stays on the source node waiting for the destination. We then used a probabilistic routing protocol (PROPHET) in which a node routes requests to the neighbor with the highest meeting ability.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Tools : Netbeans 7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

OFRR PROTOCOL:

4.1 ALGORITHM

PSEUDO-CODE FOR PCS ALGORITHM:

4.2 MODULES:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

4.3 MODULE DESCRIPTION:

DELAY TOLERANT NETWORKS (DTNS):

P2P FILE SHARING IN MANETS:

MANETS WITH RWP MODEL:

DISTRIBUTED FILE REPLICATION:

EXPERIMENTAL RESULTS:

REPLICA COST:

REPLICA DISTRIBUTAION:

AVERAGE DELAY:

CHAPTER 8

8.1 CONCLUSION & FUTURE:

In this paper, we investigated the problem of how to allocate limited resources for file replication for the purpose of global optimal file searching efficiency in MANETs. Unlike previous protocols that only consider storage as resources, we also consider file holder’s ability to meet nodes as available resources since it also affects the availability of files on the node. We first theoretically analyzed the influence of replica distribution on the average querying delay under constrained available resources with two mobility models, and then derived an optimal replication rule that can allocate resources to file replicas with minimal average querying delay.

Finally, we designed the priority competition and split replication protocol (PCS) that realizes the optimal replication rule in a fully distributed manner. Extensive experiments on both GENI testbed, NS-2, and event-driven simulator with real traces and synthesized mobility confirm both the correctness of our theoretical analysis and the effectiveness of PCS in MANETs. In this study, we focus on a static set of files in the network. In our future work, we will theoretically analyze a more complex environment including file dynamics (file addition and deletion, file timeout) and dynamic node querying pattern.

k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

05/08/201902/07/2019 by admin

k-Nearest Neighbor Classification overSemantically Secure Encrypted Relational DataBharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEEAbstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among governmentagencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of variousprivacy issues, many theoretical and practical solutions to the classification problem have been proposed under different securitymodels. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encryptedform, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preservingclassification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. Inparticular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality ofdata, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop asecure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposedprotocol using a real-world dataset under different parameter settings.Index Terms—Security, k-NN classifier, outsourced databases, encryptionÇ1 INTRODUCTIONRECENTLY, the cloud computing paradigm [1] is revolutionizingthe organizations’ way of operating their dataparticularly in the way they store, access and process data.As an emerging computing paradigm, cloud computingattracts many organizations to consider seriously regardingcloud potential in terms of its cost-efficiency, flexibility, andoffload of administrative overhead. Most often, organizationsdelegate their computational operations in addition totheir data to the cloud. Despite tremendous advantages thatthe cloud offers, privacy and security issues in the cloud arepreventing companies to utilize those advantages. Whendata are highly sensitive, the data need to be encryptedbefore outsourcing to the cloud. However, when data areencrypted, irrespective of the underlying encryption scheme,performing any data mining tasks becomes very challengingwithout ever decrypting the data. There are other privacyconcerns, demonstrated by the following example.Example 1. Suppose an insurance company outsourced itsencrypted customers database and relevant data miningtasks to a cloud. When an agent from the companywants to determine the risk level of a potential newcustomer, the agent can use a classification method todetermine the risk level of the customer. First, theagent needs to generate a data record q for thecustomer containing certain personal information ofthe customer, e.g., credit score, age, marital status, etc.Then this record can be sent to the cloud, and thecloud will compute the class label for q. Nevertheless,since q contains sensitive information, to protect thecustomer’s privacy, q should be encrypted before sendingit to the cloud.The above example shows that data mining overencrypted data (denoted by DMED) on a cloud also needsto protect a user’s record when the record is a part of a datamining process. Moreover, cloud can also derive useful andsensitive information about the actual data items by observingthe data access patterns even if the data are encrypted[2], [3]. Therefore, the privacy/security requirements of theDMED problem on a cloud are threefold: (1) confidentialityof the encrypted data, (2) confidentiality of a user’s queryrecord, and (3) hiding data access patterns.Existing work on privacy-preserving data mining(PPDM) (either perturbation or secure multi-party computation(SMC) based approach) cannot solve the DMED problem.Perturbed data do not possess semantic security, sodata perturbation techniques cannot be used to encrypthighly sensitive data. Also the perturbed data do not producevery accurate data mining results. Secure multi-partycomputation based approach assumes data are distributedand not encrypted at each participating party. In addition,many intermediate computations are performed based onnon-encrypted data. As a result, in this paper, we proposednovel methods to effectively solve the DMED problemassuming that the encrypted data are outsourced to a cloud.Specifically, we focus on the classification problem since itis one of the most common data mining tasks. Because eachclassification technique has their own advantage, to be concrete,this paper concentrates on executing the k-nearestneighbor classification method over encrypted data in thecloud computing environment._ B.K. Samanthula is with the Department of Computer Science, PurdueUniversity, 305 N. University Street, West Lafayette, IN 47907.E-mail: bsamanth@purdue.edu._ Y. Elmehdwi and W. Jiang are with the Department of Computer Science,Missouri University of Science and Technology, 310 CS Building,500 West 15th St., Rolla, MO 65409. E-mail: {ymez76, wjiang}@mst.edu.Manuscript received 23 Oct. 2013; revised 10 Sept. 2014; accepted 29 Sept.2014. Date of publication 19 Oct. 2014; date of current version 27 Mar. 2015.Recommended for acceptance by G. Miklau.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2364027IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 12611041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.1.1 Problem DefinitionSuppose Alice owns a database D of n records t1; . . . ; tn andm þ 1 attributes. Let ti;j denote the jth attribute value ofrecord ti. Initially, Alice encrypts her database attributewise,that is, she computes Epkðti;jÞ, for 1 _ i _ n and1 _ j _ m þ 1, where column ðm þ 1Þ contains the classlabels. We assume that the underlying encryption scheme issemantically secure [4]. Let the encrypted database bedenoted by D0. We assume that Alice outsources D0 as wellas the future classification process to the cloud.Let Bob be an authorized user who wants to classify hisinput record q ¼ hq1; . . . ; qmi by applying the k-NN classificationmethod based on D0. We refer to such a process asprivacy-preserving k-NN (PPkNN) classification overencrypted data in the cloud. Formally, we define thePPkNN protocol as:PPkNNðD0; qÞ ! cq;where cq denotes the class label for q after applying k-NNclassification method on D0 and q.1.2 Our ContributionsIn this paper, we propose a novel PPkNN protocol, a securek-NN classifier over semantically secure encrypted data. Inour protocol, once the encrypted data are outsourced to thecloud, Alice does not participate in any computations.Therefore, no information is revealed to Alice. In addition,our protocol meets the following privacy requirements:_ Contents of D or any intermediate results should notbe revealed to the cloud._ Bob’s query q should not be revealed to the cloud._ cq should be revealed only to Bob. Also, no otherinformation should be revealed to Bob._ Data access patterns, such as the records correspondingto the k-nearest neighbors of q, should not berevealed to Bob and the cloud (to prevent any inferenceattacks).We emphasize that the intermediate results seen by the cloudin our protocol are either newly generated randomizedencryptions or random numbers. Thus, which data recordscorrespond to the k-nearest neighbors and the output classlabel are not known to the cloud. In addition, after sendinghis encrypted query record to the cloud, Bob does notinvolve in any computations. Hence, data access patterns arefurther protected from Bob (see Section 5 for more details).The rest of the paper is organized as follows. We discussthe existing related work and some concepts as a backgroundin Section 2. A set of privacy-preserving protocolsand their possible implementations are provided in Section3. The formal security proofs for the mentioned privacy-preservingprimitives are provided in Section 4. The proposedPPkNN protocol is explained in detail in Section 5. Section 6discusses the performance of the proposed protocol underdifferent parameter settings. We conclude the paper alongwith future work in Section 7.2 RELATED WORK AND BACKGROUNDDue to space limitations, here we briefly review the existingrelated work and provide some definitions as a background.Please refer to our technical report [5] for a more elaboratedrelated work and background.At first, it seems fully homomorphic cryptosystems (e.g.,[6]) can solve the DMED problem since it allows a thirdparty(that hosts the encrypted data) to execute arbitraryfunctions over encrypted data without ever decryptingthem. However, we stress that such techniques are veryexpensive and their usage in practical applications have yetto be explored. For example, it was shown in [7] that evenfor weak security parameters one “bootstrapping” operationof the homomorphic operation would take at least30 seconds on a high performance machine.It is possible to use the existing secret sharing techniquesin SMC, such as Shamir’s scheme [8], to develop a PPkNNprotocol. However, our work is different from the secretsharing based solution in the following aspect. Solutionsbased on the secret sharing schemes require at least threeparties whereas our work require only two parties. Forexample, the constructions based on Sharemind [9], a wellknownSMC framework which is based on the secret sharingscheme, assumes that the number of participating partiesis three. Thus, our work is orthogonal to Sharemind andother secret sharing based schemes.2.1 Privacy-Preserving Data MiningAgrawal and Srikant [10], Lindell and Pinkas [11] werethe first to introduce the notion of privacy-preservingunder data mining applications. The existing PPDM techniquescan broadly be classified into two categories: (i)data perturbation and (ii) data distribution. Agrawal andSrikant [10] proposed the first data perturbation techniqueto build a decision-tree classifier, and many othermethods were proposed later (e.g., [12], [13], [14]). However,as mentioned earlier in Section 1, data perturbationtechniques cannot be applicable for semantically secureencrypted data. Also, they do not produce accurate datamining results due to the addition of statistical noises tothe data. On the other hand, Lindell and Pinkas [11] proposedthe first decision tree classifier under the two-partysetting assuming the data were distributed between them.Since then much work has been published using SMCtechniques (e.g., [15], [16], [17]). We claim that the PPkNNproblem cannot be solved using the data distributiontechniques since the data in our case is encrypted and notdistributed in plaintext among multiple parties. For thesame reasons, we also do not consider secure k-NN methodsin which the data are distributed between two parties(e.g., [18]).2.2 Query Processing over Encrypted DataVarious techniques related to query processing overencrypted data have been proposed, e.g., [19], [20], [21].However, we observe that PPkNN is a more complex problemthan the execution of simple kNN queries overencrypted data [22], [23]. For one, the intermediate k-nearestneighbors in the classification process, should not be disclosedto the cloud or any users. We emphasize that therecent method in [23] reveals the k-nearest neighbors to theuser. Second, even if we know the k-nearest neighbors, it isstill very difficult to find the majority class label amongthese neighbors since they are encrypted at the first place to1262 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015prevent the cloud from learning sensitive information.Third, the existing work did not addressed the access patternissue which is a crucial privacy requirement from theuser’s perspective.In our most recent work [24], we proposed a novelsecure k-nearest neighbor query protocol over encrypteddata that protects data confidentiality, user’s query privacy,and hides data access patterns. However, as mentionedabove, PPkNN is a more complex problem and itcannot be solved directly using the existing secure k-nearestneighbor techniques over encrypted data. Therefore,in this paper, we extend our previous work in [24] andprovide a new solution to the PPkNN classifier problemover encrypted data.More specifically, this paper is different from our preliminarywork [24] in the following four aspects. First, inthis paper, we introduced new security primitives,namely secure minimum (SMIN), secure minimum out ofn numbers (SMINn), secure frequency (SF), and proposednew solutions for them. Second, the work in [24] did notprovide any formal security analysis of the underlyingsub-protocols. On the other hand, this paper provides formalsecurity proofs of the underlying sub-protocols aswell as the PPkNN protocol under the semi-honest model.Additionally, we discuss various techniques throughwhich the proposed PPkNN protocol can possibly beextended to a protocol that is secure under the malicioussetting. Third, our preliminary work in [24] addressesonly secure kNN query which is similar to Stage 1 ofPPkNN. However, Stage 2 in PPkNN is entirely new.Finally, our empirical analyses in Section 6 are based on areal dataset whereas the results in [24] are based on asimulated dataset. Furthermore, new experimental resultsare included in this paper.2.3 Threat ModelWe adopt the security definitions in the literature of securemulti-party computation [25], [26], and there are three commonadversarial models under SMC: semi-honest, covertand malicious. In this paper, to develop secure and efficientprotocols, we assume that parties are semi-honest. Briefly,the following definition captures the properties of a secureprotocol under the semi-honest model [27], [28].Definition 1. Let ai be the input of party Pi, PiðpÞ be Pi’s executionimage of the protocol p and bi be the output for party Picomputed from p. Then, p is secure if PiðpÞ can be simulatedfrom ai and bi such that distribution of the simulated image iscomputationally indistinguishable from PiðpÞ.In the above definition, an execution image generallyincludes the input, the output and the messages communicatedduring an execution of a protocol. To prove a protocolis secure under semi-honest model, we generally need toshow that the execution image of a protocol does not leakany information regarding the private inputs of participatingparties [28].2.4 Paillier CryptosystemThe Paillier cryptosystem is an additive homomorphic andprobabilistic public-key encryption scheme whose securityis based on the Decisional Composite Residuosity Assumption[4]. Let Epk be the encryption function with public keypk given by (N; g), where N is a product of two large primesof similar bit length and g is a generator in Z_N2 . Also, let Dskbe the decryption function with secret key sk. For any giventwo plaintexts a; b 2 ZN, the Paillier encryption schemeexhibits the following properties:(1) Homomorphic addition.DskðEpkða þ bÞÞ ¼ DskðEpkðaÞ _ EpkðbÞmodN2Þ:(2) Homomorphic multiplication.DskðEpkða _ bÞÞ ¼ DskðEpkðaÞbmodN2Þ:(3) Semantic security. The encryption scheme is semanticallysecure[28], [29]. Briefly, given a set of ciphertexts,an adversary cannot deduce any additionalinformation about the plaintext(s).For succinctness, we drop the modN2 term during homomorphicoperations in the rest of this paper.3 PRIVACY-PRESERVING PRIMITIVESHere we present a set of generic sub-protocols that willbe used in constructing our proposed k-NN protocol inSection 5. All of the below protocols are considered undertwo-party semi-honest setting. In particular, we assumethe existence of two semi-honest parties P1 and P2 suchthat the Paillier’s secret key sk is known only to P2whereas pk is public._ Secure multiplication (SM). This protocol considers P1with input ðEpkðaÞ; EpkðbÞÞ and outputs Epkða _ bÞ toP1, where a and b are not known to P1 and P2. Duringthis process, no information regarding a and b isrevealed to P1 and P2._ Secure squared euclidean distance (SSED). In this protocol,P1 with input ðEpkðXÞ; EpkðY ÞÞ and P2 with sksecurely compute the encryption of squared euclideandistance between vectors X and Y . Here X andY are m dimensional vectors where EpkðXÞ ¼hEpkðx1Þ; . . . ; EpkðxmÞi and EpkðYÞ ¼ hEpkðy1Þ; . . . ;EpkðymÞi. The output EpkðjX _ Y j2Þ will be knownonly to P1._ Secure bit-decomposition (SBD). Here P1 with inputEpkðzÞ and P2 securely compute the encryptions ofthe individual bits of z, where 0 _ z < 2l. The output½z_ ¼ hEpkðz1Þ; . . . ; EpkðzlÞi is known only to P1. Herez1 and zl are the most and least significant bits ofinteger z, respectively._ Secure minimum. In this protocol, P1 holds privateinput ðu0; v0Þ and P2 holds sk, where u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su (resp., sv)denotes the secret associated with u (resp., v). Thegoal of SMIN is for P1 and P2 to jointly compute theencryptions of the individual bits of minimum numberbetween u and v. In addition, they computeEpkðsminðu;vÞÞ. That is, the output is ð½minðu; vÞ_;Epkðsminðu;vÞÞÞ which will be known only to P1.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1263During this protocol, no information regarding thecontents of u; v; su; and sv is revealed to P1 and P2._ Secure minimum out of n numbers. In this protocol, weconsider P1 with n encrypted vectors ð½d1_; . . . ; ½dn_Þalong with their respective encrypted secrets and P2with sk. Here ½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi wheredi;1 and di;l are the most and least significant bitsof integer di respectively, for 1 _ i _ n. The secretof di is given by sdi . P1 and P2 jointly compute½minðd1; . . . ; dnÞ_. In addition, they computeEpkðsminðd1;…;dnÞÞ. At the end of this protocol, the outputð½minðd1; . . . ; dnÞ_; Epkðsminðd1;…;dnÞÞÞ is knownonly to P1. During SMINn, no information regardingany of di’s and their secrets is revealed to P1 and P2._ Secure Bit-OR (SBOR). P1 with input ðEpkðo1Þ;Epkðo2ÞÞ and P2 securely compute Epkðo1 _ o2Þ, whereo1 and o2 are 2 bits. The output Epkðo1 _ o2Þ is knownonly to P1._ Secure frequency. Here P1 with private inputðhEpkðc1Þ; . . .EpkðcwÞi; hEpkðc01Þ; . . . ; Epkðc0kÞiÞ and P2securely compute the encryption of the frequency ofcj, denoted by fðcjÞ, in the list hc01; . . . ; c0ki, for1 _ j _ w. Here we explicitly assume that cj’s areunique and c0i 2 fc1; . . . ; cwg, for 1 _ i _ k. The outputhEpkðfðc1ÞÞ; . . .; EpkðfðcwÞÞi will be known onlyto P1. During the SF protocol, no information regardingc0i, cj, and fðcjÞ is revealed to P1 and P2, for1 _ i _ k and 1 _ j _ w.Now we either propose a new solution or refer to themost efficient known implementation to each of theabove protocols. First of all, efficient solutions to SM,SSED, SBD and SBOR were discussed in [24]. Therefore,in this paper, we discuss SMIN, SMINn, and SF problemsin detail and propose new solutions to each one ofthem.Secure minimum. In this protocol, we assume that P1holds private input ðu0; v0Þ and P2 holds sk, whereu0 ¼ ð½u_; EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ. Here su and svdenote the secrets corresponding to u and v, respectively.The main goal of SMIN is to securely compute theencryptions of the individual bits of minðu; vÞ, denotedby ½minðu; vÞ_. Here ½u_ ¼ hEpkðu1Þ; . . . ; EpkðulÞi and ½v_ ¼hEpkðv1Þ; . . . ; EpkðvlÞi, where u1 (resp., v1) and ul (resp., vl)are the most and least significant bits of u (resp., v), respectively.In addition, they compute Epkðsminðu;vÞÞ, the encryptionof the secret corresponding to the minimum valuebetween u and v. At the end of SMIN, the outputð½minðu; vÞ_; Epkðsminðu;vÞÞÞ is known only to P1.We assume that 0 _ u; v < 2l and propose a novelSMIN protocol. Our solution to SMIN is mainly motivatedfrom the work of [24]. Precisely, the basic idea ofthe proposed SMIN protocol is for P1 to randomly choosethe functionality F (by flipping a coin), where F is eitheru > v or v > u, and to obliviously execute F with P2.Since F is randomly chosen and known only to P1, theresult of the functionality F is oblivious to P2. Based onthe comparison result and chosen F, P1 computes½minðu; vÞ_ and Epkðsminðu;vÞÞ locally using homomorphicproperties.Algorithm 1. SMINðu0; v0Þ ! ½minðu; vÞ_; Epkðsminðu;vÞÞRequire: P1 has u0 ¼ ð½u_;EpkðsuÞÞ and v0 ¼ ð½v_; EpkðsvÞÞ, where0 _ u; v < 2l; P2 has sk1: P1:(a). Randomly choose the functionality F(b). for i ¼ 1 to l do:_ Epkðui _ viÞ SMðEpkðuiÞ; EpkðviÞÞ_ Ti Epkðui _ viÞ_ Hi Hrii_1 _ Ti; ri 2R ZN and H0 ¼ Epkð0Þ_ Fi Epkð_1Þ _ Hi_ if F : u > v then:_ Wi EpkðuiÞ _ Epkðui _ viÞN_1_ Gi Epkðvi _ uiÞ _ Epkð^riÞ; ^ri 2R ZNelse_ Wi EpkðviÞ _ Epkðui _ viÞN_1_ Gi Epkðui _ viÞ _ Epkð^riÞ; ^ri 2R ZN_ Li Wi _ Fr0ii ; r0i 2R ZN(c). if F :u > v then: d Epkðsv _ suÞ _ EpkðrÞelse d Epkðsu _ svÞ _ EpkðrÞ, where r 2R ZN(d). G0 p1ðGÞ and L0 p2ðLÞ(e). Send d; G0 and L0 to P22: P2:(a). Receive d; G0 and L0 from P1(b). Decryption:Mi DskðL0iÞ, for 1 _ i _ l(c). if 9 j such that Mj ¼ 1 then a 1else a 0(d). if a ¼ 0 then:_ M0i Epkð0Þ, for 1 _ i _ l_ d0 Epkð0Þelse_ M0i G0i _ rN, where r 2R ZN and is different for1 _ i _ l_ d0 d _ rNd, where rd 2R ZN(e). Send M0;EpkðaÞ and d0 to P13: P1:(a). ReceiveM0;EpkðaÞ and d0 from P2(b).eMp_11 ðM0Þ and u d0 _ EpkðaÞN_r(c). _i eMi _ EpkðaÞN_^ri , for 1 _ i _ l(d). if F : u > v then:_ Epkðsminðu;vÞÞ EpkðsuÞ _ u_ Epkðminðu; vÞiÞ EpkðuiÞ _ _i, for 1 _ i _ lelse_ Epkðsminðu;vÞÞ EpkðsvÞ _ u_ Epkðminðu; vÞiÞ EpkðviÞ _ _i, for 1 _ i _ lThe overall steps involved in the SMIN protocol areshown in Algorithm 1. To start with, P1 initially chooses thefunctionality F as either u > v or v > u randomly. Then,using the SM protocol, P1 computes Epkðui _ viÞ with thehelp of P2, for 1 _ i _ l. After this, the protocol has the followingkey steps, performed by P1 locally, for 1 _ i _ l:_ Compute the encrypted bit-wise XOR between thebits ui and vi using the following formulation1Ti ¼ EpkðuiÞ _ EpkðviÞ _ Epkðui _ viÞN_2_ Compute an encrypted vector H by preserving thefirst occurrence of Epkð1Þ (if there exists one) in T byinitializing H0 ¼ Epkð0Þ. The rest of the entries of Hare computed as Hi ¼ Hrii_1 _ Ti. We emphasize that1. In general, for any two given bits o1 and o2, the propertyo1 _ o2 ¼ o1 þ o2 _ 2ðo1 _ o2Þ always holds.1264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015at most one of the entry in H is Epkð1Þ and theremaining entries are encryptions of either 0 or a randomnumber._ Then, P1 computes Fi ¼ Epkð_1Þ _ Hi. Note that“_1” is equivalent to “N _ 1” under ZN. From theabove discussions, it is clear that Fi ¼ Epkð0Þ at mostonce since Hi is equal to Epkð1Þ at most once. Also, ifFj ¼ Epkð0Þ, then index j is the position at which thebits of u and v differ first (starting from the most significantbit position).Now, depending on F, P1 creates two encrypted vectors Wand G as follows, for 1 _ i _ l:_ If F : u > v, computeWi ¼ Epkðui _ ð1 _ viÞÞ;Gi ¼ Epkðvi _ uiÞ _ Epkð^riÞ ¼ Epkðvi _ ui þ ^riÞ:_ If F : v > u, computeWi ¼ Epkðvi _ ð1 _ uiÞÞ;Gi ¼ Epkðui _ viÞ _ Epkð^riÞ ¼ Epkðui _ vi þ ^riÞ;where ^ri is a random number (hereafter denoted by 2R) inZN. The observation is that if F : u > v, then Wi ¼ Epkð1Þ iffui > vi, and Wi ¼ Epkð0Þ otherwise. Similarly, whenF : v > u, we have Wi ¼ Epkð1Þ iff vi > ui, and Wi ¼ Epkð0Þotherwise. Also, depending of F, Gi stores the encryption ofthe randomized difference between ui and vi which will beused in later computations.After this, P1 computes L by combining F and W. Moreprecisely, P1 computes Li ¼ Wi _ Fr0ii , where r0i is a randomnumber in ZN. The observation here is if 9 an index j suchthat Fj ¼ Epkð0Þ, denoting the first flip in the bits of u and v,then Wj stores the corresponding desired information, i.e.,whether uj > vj or vj > uj in encrypted form. In addition,depending on F, P1 computes the encryption of randomizeddifference between su and sv and stores it in d. Specifically,if F : u > v, then d ¼ Epkðsv _ su þ rÞ. Otherwise,d ¼ Epkðsu _ sv þ rÞ, where r 2R ZN.After this, P1 permutes the encrypted vectors G and Lusing two random permutation functions p1 and p2. Specifically,P1 computes G0 ¼ p1ðGÞ and L0 ¼ p2ðLÞ, and sendsthem along with d to P2. Upon receiving, P2 decrypts L0component-wise to get Mi ¼ DskðL0iÞ, for 1 _ i _ l, andchecks for index j. That is, if Mj ¼ 1, then P2 sets a to 1, otherwisesets it to 0. In addition, P2 computes a new encryptedvector M0 depending on the value of a. Precisely, if a ¼ 0,then M0i ¼ Epkð0Þ, for 1 _ i _ l. Here Epkð0Þ is different foreach i. On the other hand, when a ¼ 1, P2 sets M0i to the rerandomizedvalue of G0i. That is, M0i ¼ G0i _ rN, where theterm rN comes from re-randomization and r 2R ZN shouldbe different for each i. Furthermore, P2 computesd0 ¼ Epkð0Þ if a ¼ 0. However, when a ¼ 1, P2 sets d0 tod _ rNd, where rd is a random number in ZN. Then, P2 sendsM0; EpkðaÞ and d0 to P1. After receiving M0; EpkðaÞ and d0, P1computes the inverse permutation of M0 aseM¼ p_11 ðM0Þ.Then, P1 performs the following homomorphic operationsto compute the encryption of ith bit of minðu; vÞ, i.e.,Epkðminðu; vÞiÞ, for 1 _ i _ l:_ Remove the randomness fromeMi by computing_i ¼ eMi _ EpkðaÞN_^ri_ If F : u>v, compute Epkðminðu; vÞiÞ ¼ EpkðuiÞ__i ¼ Epkðui þ a _ ðvi _ uiÞÞ. Otherwise, computeEpkðminðu; vÞiÞ¼EpkðviÞ _ _i ¼ Epkðviþ a _ ðui _ viÞÞ.Also, depending on F, P1 computes Epkðsminðu;vÞÞ as follows.If F : u > v, P1 computes Epkðsminðu;vÞÞ ¼ EpkðsuÞ _ u,where u¼d0 _ EpkðaÞN_r. Otherwise, he/she computesEpkðsminðu;vÞÞ¼ EpkðsvÞ _ u.In the SMIN protocol, one main observation (upon whichwe can also justify the correctness of the final output) is thatif F : u > v, then minðu; vÞi ¼ ð1 _ aÞ _ ui þ a _ vi alwaysholds, for 1 _ i _ l. On the other hand, if F : v > u, thenminðu; vÞi ¼ a _ ui þ ð1 _ aÞ _ vi always holds. Similar conclusionscan be drawn for sminðu;vÞ. We emphasize that usingsimilar formulations one can also design a SMAX protocolto compute ½maxðu; vÞ_ and Epkðsmaxðu;vÞÞ. Also, we stressthat there can be multiple secrets of u and v that can be fedas input (in encrypted form) to SMIN and SMAX. For example,let s1u and s2u (resp., s1vand s2v) be two secrets associatedwith u (resp., v). Then the SMIN protocol takesð½u_; Epkðs1uÞ; Epkðs2uÞÞ and ð½v_; Epkðs1vÞ; Epkðs2vÞÞ as P1’s inputand outputs ½minðu; vÞ_; Epkðs1minðu;vÞÞ and Epkðs2minðu;vÞÞ to P1.Example 2. For simplicity, consider that u ¼ 55, v ¼ 58, andl ¼ 6. Suppose su and sv be the secrets associated with uand v, respectively. Assume that P1 holds ð½55_; EpkðsuÞÞð½58_; EpkðsvÞÞ. In addition, we assume that P1’s randompermutation functions are as given below. Without lossof generality, suppose P1 chooses the functionalityF : v > u. Then, various intermediate results based onthe SMIN protocol are as shown in Table 1. Followingfrom Table 1, we observe that:_ At most one of the entry in H is Epkð1Þ, namelyH3, and the remaining entries are encryptions ofeither 0 or a random number in ZN._ Index j ¼ 3 is the first position at which the correspondingbits of u and v differ.TABLE 1P1 Chooses F Asv > uWhere u ¼ 55 and v ¼ 58½u_ ½v_ Wi Gi Gi Hi Fi Li Gi’ L0i Mi _i mini1 1 0 r 0 0 _1 r 1 þr r r 0 11 1 0 r 0 0 _1 r r r r 0 10 1 1 _1 þ r 1 1 0 1 1þr r r _1 01 0 0 1 þ r 1 r r r _1 þr r r 1 11 1 0 r 0 r r r r 1 1 0 11 0 0 1 þ r 1 r r r r r r 1 1All column values are in encrypted form exceptMi column. Also, r 2R ZN isdifferent for each row and column.i = 1 2 3 4 5 6# # # # # #p1ðiÞ = 6 5 4 3 2 1p2ðiÞ = 2 1 5 6 3 4SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1265_ F3 ¼ Epkð0Þ since H3 is equal to Epkð1Þ. Also, sinceM5 ¼ 1, P2 sets a to 1._ Epkðsminðu;vÞÞ ¼ Epkða _ su þ ð1 _ aÞ _ svÞ¼ EpkðsuÞ.At the end, only P1 knows ½minðu; vÞ_ ¼ ½u_ ¼ ½55_ andEpkðsminðu;vÞÞ ¼ EpkðsuÞ.Secure minimum out of n numbers. Consider P1 with privateinput ð½d1_; . . . ; ½dn_Þ along with their encrypted secretsand P2 with sk, where 0 _ di < 2l and ½di_ ¼ hEpkðdi;1Þ;. . . ; Epkðdi;lÞi, for 1 _ i _ n. Here the secret of di is denotedby Epkðsdi Þ, for 1 _ i _ n. The main goal of the SMINn protocolis to compute ½minðd1; . . . ; dnÞ_ ¼ ½dmin_ without revealingany information about di’s to P1 and P2. In addition, theycompute the encryption of the secret corresponding to theglobal minimum, denoted by Epkðsdmin Þ. Here we constructa new SMINn protocol by utilizing SMIN as the buildingblock. The proposed SMINn protocol is an iterativeapproach and it computes the desired output in an hierarchicalfashion. In each iteration, minimum between a pair ofvalues and the secret corresponding to the minimum valueare computed (in encrypted form) and fed as input to thenext iteration, thus, generating a binary execution tree in abottom-up fashion. At the end, only P1 knows the finalresult ½dmin_ and Epkðsdmin Þ.Algorithm 2. SMINnðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_; Epkðsdn ÞÞÞ! ð½dmin_; Epkðsdmin ÞÞRequire: P1 has ðð½d1_; Epkðsd1 ÞÞ; . . . ; ð½dn_;Epkðsdn ÞÞÞ; P2 has sk1: P1:(a). ½d0i_ ½di_ and s0i Epkðsdi Þ, for 1 _ i _ n(b). num n2: for i ¼ 1 to dlog2 ne:(a). for 1 _ j _ num2_ _:_ if i ¼ 1 then:_ ð½d02j_1_; s02j_1Þ SMINðx; yÞ, wherex ¼ ð½d02j_1_; s02j_1Þ and y ¼ ð½d02j_; s02jÞ_ ½d02j_ 0 and s02j 0else_ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ SMINðx; yÞ, wherex ¼ ð½d02iðj_1Þþ1_; s02iðj_1Þþ1Þ and y ¼ ð½d02ij_1_; s02ij_1Þ_ ½d02ij_1_ 0 and s02ij_1 0(b). num num2_ _3: P1: ½dmin_ ½d01_ and EpkðsdminÞ s01The overall steps involved in the proposed SMINn protocolare highlighted in Algorithm 2. Initially, P1 assigns ½di_and Epkðsdi Þ to a temporary vector ½d0i_ and variable s0i, for1 _ i _ n, respectively. Also, he/she creates a global variablenum and initializes it to n, where num represents thenumber of (non-zero) vectors involved in each iteration.Since the SMINn protocol executes in a binary tree hierarchy(bottom-up fashion), we have dlog2 ne iterations, and in eachiteration, the number of vectors involved varies. In the firstiteration (i.e., i ¼ 1), P1 with private inputðð½d02j_1_; s02j_1Þ; ð½d02j_; s02jÞÞ and P2 with sk involve in theSMIN protocol, for 1 _ j _ num2_ _. At the end of the first iteration,only P1 knows ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ, andnothing is revealed to P2, for 1 _ j _ num2_ _. Also, P1 storesthe result ½minðd02j_1; d02jÞ_ and s0minðd02j_1;d02jÞ in ½d02j_1_ ands02j_1, respectively. In addition, P1 updates the values of½d02j_, s02j to 0 and num to num2_ _, respectively.During the ith iteration, only the non-zero vectors (alongwith the corresponding encrypted secrets) are involved inSMIN, for 2 _ i _ dlog2 ne. For example, during the seconditeration (i.e., i ¼ 2), only ð½d01_; s01Þ; ð½d03_; s03Þ, and so on areinvolved. Note that in each iteration, the output is revealedonly to P1 and num is updated to num2_ _. At the end ofSMINn, P1 assigns the final encrypted binary vector ofglobal minimum value, i.e., ½minðd1; . . . ; dnÞ_ which is storedin ½d01_, to ½dmin_. Also, P1 assigns s01 to Epkðsdmin Þ.Example 3. Suppose P1 holds h½d1_; . . . ; ½d6_i (i.e., n ¼ 6). Forsimplicity, here we are assuming that there are no secretsassociated with di’s. Then, based on the SMINn protocol,the binary execution tree (in a bottom-up fashion) tocompute ½minðd1; . . . ; d6Þ_ is shown in Fig. 1. Note that,initially ½d0i_ ¼ ½di_.Secure frequency. Let us consider a situation where P1holds private input ðhEpkðc1Þ; . . . ; EpkðcwÞi; hEpkðc01Þ;. . . ; Epkðc0kÞiÞ and P2 holds the secret key sk. The goal of theSF protocol is to securely compute EpkðfðcjÞÞ, for 1 _ j _ w.Here fðcjÞ denotes the number of times element cj occurs(i.e., frequency) in the list hc01; . . . ; c0ki. We explicitly assumethat c0i 2 fc1; . . . ; cwg, for 1 _ i _ k.The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is revealed onlyto P1. During the SF protocol, neither c0i nor cj is revealed toP1 and P2. Also, fðcjÞ is kept private from both P1 and P2,for 1 _ i _ k and 1 _ j _ w.The overall steps involved in the proposed SF protocolare shown in Algorithm 3. To start with, P1 initially computesan encrypted vector Si such that Si;j ¼ Epkðcj _ c0iÞ, for1 _ j _ w. Then, P1 randomizes Si component-wise to getS0i;j ¼ Epkðri;j _ ðcj _ c0iÞÞ, where ri;j is a random number inZN. After this, for 1 _ i _ k, P1 randomly permutes S0icomponent-wise using a random permutation function pi(known only to P1). The output Zi piðS0iÞ is sent to P2.Upon receiving, P2 decrypts Zi component-wise, computesa vector ui and proceeds as follows:_ If DskðZi;jÞ ¼ 0, then ui;j is set to 1. Otherwise, ui;j isset to 0._ The observation is, since c0i 2 fc1; . . . ; cwg, thatexactly one of the entries in vector Zi is an encryptionof 0 and the rest are encryptions of randomnumbers. This further implies that exactly one of thedecrypted values of Zi is 0 and the rest are randomnumbers. Precisely, if ui;j ¼ 1, then c0i ¼ cp_1ðjÞ.Fig. 1. Binary execution tree for n ¼ 6 based on SMINn.1266 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015_ Compute Ui;j ¼ Epkðui;jÞ and send it to P1, for1 _ i _ k and 1 _ j _ w.Then, P1 performs row-wise inverse permutation on it to getVi ¼ p_1i ðUiÞ, for 1 _ i _ k. Finally, P1 computesEpkðcjÞ ¼Qki¼1 Vi;j locally, for 1 _ j _ w.Algorithm 3. SFðL;L0Þ ! hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞiRequire: P1 has L ¼ hEpkðc1Þ; . . .;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . . ;Epkðc0kÞi and hp1; . . . ; pki; P2 has sk1: P1:(a). for i ¼ 1 to k do:_ Ti Epkðc0iÞN_1_ for j ¼ 1 to w do:_ Si;j EpkðcjÞ _ Ti_ S0i;j Si;jri;j , where ri;j 2R ZN_ Zi piðS0iÞ(b). Send Z to P22: P2:(a). Receive Z from P1(b). for i ¼ 1 to k do_ for j ¼ 1 to w do:_ if DskðZi;jÞ ¼ 0 then ui;j 1else ui;j 0_ Ui;j Epkðui;jÞ(c). Send U to P13: P1:(a). Receive U from P2(b). Vi p_1i ðUiÞ, for 1 _ i _ k(c). EpkðfðcjÞÞQki¼1 Vi;j, for 1 _ j _ w4 SECURITY ANALYSIS OF PRIVACY-PRESERVINGPRIMITIVES UNDER THE SEMI-HONEST MODELFirst of all, we emphasize that the outputs in the above mentionedprotocols are always in encrypted format, and areknown only to P1. Also, all the intermediate results revealedto P2 are either random or pseudo-random.Since the proposed SMIN protocol (which is used as asub-routine in SMINn) is more complex than other protocolsmentioned above and due to space limitations, we are motivatedto provide its security proof rather than providingproofs for each protocol. Therefore, here we only include aformal security proof for the SMIN protocol based on thestandard simulation argument [28]. Nevertheless, we stressthat similar proof strategies can be used to show that otherprotocols are secure under the semi-honest model. For completeness,we provided the security proofs for the other protocolsin our technical report [5].4.1 Proof of Security for SMINAs mentioned in Section 2.3, to formally prove that SMIN issecure [28] under the semi-honest model, we need to showthat the simulated image of SMIN is computationally indistinguishablefrom the actual execution image of SMIN.An execution image generally includes the messagesexchanged and the information computed from these messages.Therefore, according to Algorithm 1, let the executionimage of P2 be denoted by PP2 ðSMINÞ, given byfhd; s þ r modNi; hG0i;mi þ ^ri mod Ni; hL0i; aig:Observe that s þ r modN and mi þ ^ri mod N are derivedupon decrypting d and G0i, for 1 _ i _ l, respectively. Notethat the modulo operator is implicit in the decryption function.Also, P2 receives L0 from P1 and let a denote the (oblivious)comparison result computed from L0. Without loss ofgenerality, suppose the simulated image of P2 bePSP2ðSMINÞ, given byfhd_; r_i; hs01;i; s02;ii; hs03;i; a0i j for 1 _ i _ lg:Here d_; s01;i and s03;i are randomly generated from ZN2whereas r_ and s02;i are randomly generated from ZN. Inaddition, a0 denotes a random bit. Since Epk is a semanticallysecure encryption scheme with resulting ciphertextsize less than N2, d is computationally indistinguishablefrom d_. Similarly, G0i and L0i are computationally indistinguishablefrom s01;i and s03;i, respectively. Also, as r and ^riare randomly generated from ZN, s þ r mod N andmi þ ^ri modN are computationally indistinguishable fromr_ and s02;i, respectively. Furthermore, because the functionalityis randomly chosen by P1 (at step 1(a) of Algorithm 1),a is either 0 or 1 with equal probability. Thus, a is computationallyindistinguishable from a0. Combining all theseresults together, we can conclude that PP2 ðSMINÞ is computationallyindistinguishable from PSP2ðSMINÞ based on Definition1. This implies that during the execution of SMIN, P2does not learn any information regarding u; v; su; sv and theactual comparison result. Intuitively speaking, the informationP2 has during an execution of SMIN is either randomor pseudo-random, so this information does not discloseanything regarding u; v; su and sv. Additionally, as F isknown only to P1, the actual comparison result is obliviousto P2.On the other hand, the execution image of P1, denoted byPP1 ðSMINÞ, is given byPP1 ðSMINÞ ¼ fM0i; EpkðaÞ; d0 j for 1 _ i _ lg:M0i and d0 are encrypted values, which are random in ZN2 ,received from P2 (at step 3(a) of Algorithm 1). Let the simulatedimage of P1 be PSP1ðSMINÞ, wherePSP1ðSMINÞ ¼ fs04;i; b0; b00 j for 1 _ i _ lg:The values s04;i; b0 and b00 are randomly generated from ZN2 .Since Epk is a semantically secure encryption scheme withresulting ciphertext size less than N2, it implies thatM0i; EpkðaÞ and d0 are computationally indistinguishablefrom s04;i; b0 and b00, respectively. Therefore, PP1 ðSMINÞ iscomputationally indistinguishable from PSP1ðSMINÞ basedon Definition 1. As a result, P1 cannot learn any informationregarding u; v; su; sv and the comparison result during theexecution of SMIN. Putting everything together, we claimthat the proposed SMIN protocol is secure under the semihonestmodel (according to Definition 1).5 THE PROPOSED PPKNN PROTOCOLIn this section, we propose a novel privacy-preserving k-NNclassification protocol, denoted by PPkNN, which isSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1267constructed using the protocols discussed in Section 3 asbuilding blocks. As mentioned earlier, we assume thatAlice’s database consists of n records, denoted byD ¼ ht1; . . . ; tni, and m þ 1 attributes, where ti;j denotes thejth attribute value of record ti. Initially, Alice encrypts herdatabase attribute-wise, that is, she computes Epkðti;jÞ, for1 _ i _ n and 1 _ j _ m þ 1, where column ðm þ 1Þ containsthe class labels. Let the encrypted database be denotedby D0. We assume that Alice outsources D0 as well as thefuture classification process to the cloud. Without loss ofgenerality, we assume that all attribute values and theireuclidean distances lie in ½0; 2lÞ. In addition, let w denote thenumber of unique class labels in D.In our problem setting, we assume the existence of twonon-colluding semi-honest cloud service providers, denotedby C1 and C2, which together form a federated cloud. Underthis setting, Alice outsources her encrypted database D0 toC1 and the secret key sk to C2. Here it is possible for thedata owner Alice to replace C2 with her private server.However, if Alice has a private server, we can argue thatthere is no need for data outsourcing from Alice’s point ofview. The main purpose of using C2 can be motivated bythe following two reasons. (i) With limited computingresource and technical expertise, it is in the best interest ofAlice to completely outsource its data management andoperational tasks to a cloud. For example, Alice may wantto access her data and analytical results using a smart phoneor any device with very limited computing capability.(ii) Suppose Bob wants to keep his input query and accesspatterns private from Alice. In this case, if Alice uses a privateserver, then she has to perform computations assumedby C2 under which the very purpose of outsourcing theencrypted data to C1 is negated.In general, whether Alice uses a private server or cloudservice provider C2 actually depends on her resources. Inparticular to our problem setting, we prefer to use C2 as thisavoids the above mentioned disadvantages (i.e., in case ofAlice using a private server) altogether. In our solution,after outsourcing encrypted data to the cloud, Alice doesnot participate in any future computations.The goal of the PPkNN protocol is to classify users’query records using D0 in a privacy-preserving manner.Consider an authorized user Bob who wants to classifyhis query record q ¼ hq1; . . . ; qmi based on D0 in C1. Theproposed PPkNN protocol mainly consists of the followingtwo stages:_ Stage 1—Secure Retrieval of k-Nearest Neighbors(SRkNN). In this stage, Bob initially sends his queryq (in encrypted form) to C1. After this, C1 and C2involve in a set of sub-protocols to securely retrieve(in encrypted form) the class labels corresponding tothe k-nearest neighbors of the input query q. At theend of this step, encrypted class labels of k-nearestneighbors are known only to C1._ Stage 2—Secure Computation of Majority Class(SCMCk). Following from Stage 1, C1 and C2 jointlycompute the class label with a majority votingamong the k-nearest neighbors of q. At the end ofthis step, only Bob knows the class label correspondingto his input query record q.The main steps involved in the proposed PPkNN protocolare as shown in Algorithm 4. We now explain each ofthe two stages in PPkNN in detail.Algorithm 4. PPkNNðD0; qÞ ! cqRequire: C1 has D0 and p; C2 has sk; Bob has q1: Bob:(a). Compute EpkðqjÞ, for 1 _ j _ m(b). Send EpkðqÞ ¼ hEpkðq1Þ; . . .;EpkðqmÞi to C12: C1 and C2:(a). C1 receives EpkðqÞ from Bob(b). for i ¼ 1 to n do:_ EpkðdiÞ SSEDðEpkðqÞ;EpkðtiÞÞ_ ½di_ SBDðEpkðdiÞÞ3: for s ¼ 1 to k do:(a). C1 and C2:_ ð½dmin_;EpkðIÞ;Epkðc0ÞÞ SMINnðu1; . . . ; unÞ, whereui ¼ ð½di_;EpkðIti Þ;Epkðti;mþ1ÞÞ_ Epkðc0sÞ Epkðc0Þ(b). C1:_ D EpkðIÞN_1_ for i ¼ 1 to n do:_ ti EpkðiÞ _ D_ t0i trii , where ri 2R ZN_ b pðt0Þ; send b to C2(c). C2:_ b0i DskðbiÞ, for 1 _ i _ n_ Compute U0, for 1 _ i _ n:_ if b0i ¼ 0, then U0i ¼ Epkð1Þ_ otherwise, U0i ¼ Epkð0ÞSend U0 to C1(d). C1: V p_1ðU0Þ(e). C1 and C2, for 1 _ i _ n and 1 _ g _ l:_ Epkðdi;gÞ SBORðVi; Epkðdi;g ÞÞ4: SCMCkðEpkðc01Þ; . . .;Epkðc0kÞÞ5.1 Stage 1: Secure Retrieval of k-NearestNeighborsDuring Stage 1, Bob initially encrypts his query q attributewise,that is, he computes EpkðqÞ ¼ hEpkðq1Þ; . . .; EpkðqmÞiand sends it to C1. The main steps involved in Stage 1 areshown as steps 1 to 3 in Algorithm 4. Upon receiving EpkðqÞ,C1 with private input ðEpkðqÞ; EpkðtiÞÞ and C2 with the secretkey sk jointly involve in the SSED protocol. HereEpkðtiÞ ¼ hEpkðti;1Þ; . . . ; Epkðti;mÞi, for 1 _ i _ n. The outputof this step, denoted by EpkðdiÞ, is the encryption of squaredeuclidean distance between q and ti, i.e., di ¼ jq _ tij2. Asmentioned earlier, EpkðdiÞ is known only to C1, for1 _ i _ n. We emphasize that the computation of exacteuclidean distance between encrypted vectors is hard toachieve as it involves square root. However, in our problem,it is sufficient to compare the squared euclidean distances asit preserves relative ordering. Then, C1 with input EpkðdiÞand C2 securely compute the encryptions of the individualbits of di using the SBD protocol. Note that the output½di_ ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi is known only to C1, where di;1and di;l are the most and least significant bits of di, for1 _ i _ n, respectively.After this, C1 and C2 compute the encryptions of classlabels corresponding to the k-nearest neighbors of q in an1268 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015iterative manner. More specifically, they compute Epkðc01Þ inthe first iteration, Epkðc02Þ in the second iteration, and so on.Here c0s denotes the class label of sth nearest neighbor to q,for 1 _ s _ k. At the end of k iterations, only C1 knowshEpkðc01Þ; . . . ; Epkðc0kÞi. To start with, consider the first iteration.C1 and C2 jointly compute the encryptions of the individualbits of the minimum value among d1; . . . ; dn andencryptions of the location and class label corresponding todmin using the SMINn protocol. That is, C1 with inputðu1; . . . ; unÞ and C2 with sk compute ð½dmin_; EpkðIÞ; Epkðc0ÞÞ,where ui ¼ ð½di_; EpkðIti Þ; Epkðti;mþ1ÞÞ, for 1 _ i _ n. Heredmin denotes the minimum value among d1; . . . ; dn; Iti andti;mþ1 denote the unique identifier and class label correspondingto the data record ti, respectively. Specifically,ðIti; ti;mþ1Þ is the secret information associated with ti. Forsimplicity, this paper assumes Iti ¼ i. In the output, I and c0denote the index and class label corresponding to dmin. Theoutput ð½dmin_; EpkðIÞ; Epkðc0ÞÞ is known only to C1. Now, C1performs the following operations locally:_ Assign Epkðc0Þ to Epkðc01Þ. Remember that, accordingto the SMINn protocol, c0 is equivalent to the classlabel of the data record that corresponds to dmin.Thus, it is same as the class label of the most nearestneighbor to q._ Compute the encryption of difference between I andi, where 1 _ i _ n. That is, C1 computes ti ¼ EpkðiÞ_EpkðIÞN_1 ¼ Epkði _ IÞ, for 1 _ i _ n._ Randomize ti to get t0i ¼ trii ¼ Epkðri _ ði _ IÞÞ,where ri is a random number in ZN. Note that t0i isan encryption of either 0 or a random number, for1 _ i _ n. Also, it is worth noting that exactly one ofthe entries in t0 is an encryption of 0 (which happensiff i ¼ I) and the rest are encryptions of randomnumbers. Permute t0 using a random permutationfunction p (known only to C1) to get b ¼ pðt0Þ andsend it to C2.Upon receiving b, C2 decrypts it component-wise to getb0i ¼ DskðbiÞ, for 1 _ i _ n. After this, he/she computes anencrypted vector U0 of length n such that U0i ¼ Epkð1Þ ifb0i ¼ 0, and Epkð0Þ otherwise. Since exactly one of entries int0 is an encryption of 0, this further implies that exactly oneof the entries in U0 is an encryption of 1 and the rest of themare encryptions of 0’s. It is important to note that if b0k ¼ 0,then p_1ðkÞ is the index of the data record that correspondsto dmin. Then, C2 sends U0 to C1. After receiving U0, C1 performsinverse permutation on it to get V ¼ p_1ðU0Þ. Notethat exactly one of the entries in V is Epkð1Þ and the remainingare encryptions of 0’s. In addition, if Vi ¼ Epkð1Þ, then tiis the most nearest tuple to q. However, C1 and C2 do notknow which entry in V corresponds to Epkð1Þ.Finally, C1 updates the distance vectors ½di_ due to the followingreason:_ It is important to note that the first nearest tuple to qshould be obliviously excluded from further computations.However, since C1 does not know the recordcorresponding to Epkðc01Þ, we need to obliviouslyeliminate the possibility of choosing this recordagain in next iterations. For this, C1 obliviouslyupdates the distance corresponding to Epkðc01Þ to themaximum value, i.e., 2l _ 1. More specifically, C1updates the distance vectors with the help of C2using the SBOR protocol as below, for 1 _ i _ n and1 _ g _ lEpkðdi;gÞ ¼ SBOR_Vi; Epkðdi;gÞ_:Note that when Vi ¼ Epkð1Þ, the corresponding distancevector di is set to the maximum value. That is,under this case, ½di_ ¼ hEpkð1Þ; . . . ; Epkð1Þi. On theother hand, when Vi ¼ Epkð0Þ, the OR operation hasno effect on the corresponding encrypted distancevector.The above process is repeated until k iterations, and ineach iteration ½di_ corresponding to the current chosen labelis set to the maximum value. However, C1 and C2 doesnot know which ½di_ is updated. In iteration s, Epkðc0sÞ isknown only to C1. At the end of Stage 1, C1 hashEpkðc01Þ; . . .; Epkðc0kÞi, the list of encrypted class labels ofk-nearest neighbors to the query q.5.2 Stage 2: Secure Computation of Majority ClassWithout loss of generality, let us assume that Alice’s datasetD consists of w unique class labels denoted by c ¼hc1; . . . ; cwi. We assume that Alice outsources her list ofencrypted classes to C1. That is, Alice outsourceshEpkðc1Þ; . . . ; EpkðcwÞi to C1 along with her encrypted databaseD0 during the data outsourcing step. Note that, forsecurity reasons, Alice may add dummy categories into thelist to protect the number of class labels, i.e., w from C1 andC2. However, for simplicity, we assume that Alice does notadd any dummy categories to c.During Stage 2, C1 with private inputs L ¼ hEpkðc1Þ; . . . ;EpkðcwÞi and L0 ¼ hEpkðc01Þ; . . . ; Epkðc0kÞi, and C2 with sksecurely compute EpkðcqÞ. Here cq denotes the majority classlabel among c01; . . . ; c0k. At the end of stage 2, only Bob knowsthe class label cq.Algorithm 5. SCMCkðEpkðc01Þ; . . .; Epkðc0kÞÞ ! cqRequire: hEpkðc1Þ; . . .; EpkðcwÞi, hEpkðc01Þ; . . .;Epkðc0kÞi are knownonly to C1; sk is known only to C21: C1 and C2:(a). hEpkðfðc1ÞÞ; . . . ;EpkðfðcwÞÞi SFðL;L0Þ, whereL ¼ hEpkðc1Þ; . . . ;EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . .; Epkðc0kÞi(b). for i ¼ 1 to w do:_ ½fðciÞ_ SBDðEpkðfðciÞÞÞ(c). ð½fmax_;EpkðcqÞÞ SMAXwðc1; . . . ;cwÞ, whereci ¼ ð½fðciÞ_; EpkðciÞÞ, for 1 _ i _ w2: C1:(a). gq EpkðcqÞ _ EpkðrqÞ, where rq 2R ZN(b). Send gq to C2 and rq to Bob3: C2:(a). Receive gq from C1(b). g0q DskðgqÞ; send g0q to Bob4: Bob:(a). Receive rq from C1 and g0q from C2(b). cq g0q _ rq modNThe overall steps involved in Stage 2 are shown inAlgorithm 5. To start with, C1 and C2 jointly compute theSAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1269encrypted frequencies of each class label using the k-nearestset as input. That is, they compute EpkðfðciÞÞ using ðL;L0Þas C1’s input to the secure frequency (SF) protocol, for1 _ i _ w. The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is knownonly to C1. Then, C1 with EpkðfðciÞÞ and C2 with sk involvein the secure bit-decomposition protocol to compute ½fðciÞ_,that is, vector of encryptions of the individual bits of fðciÞ,for 1 _ i _ w. After this, C1 and C2 jointly involve in theSMAXw protocol. Briefly, SMAXw utilizes the sub-routineSMAX to eventually compute ð½fmax_; EpkðcqÞÞ in an iterativefashion. Here ½fmax_ ¼ ½maxðfðc1Þ; . . . ; fðcwÞÞ_ and cq denotesthe majority class out of L0. At the end, the outputð½fmax_; EpkðcqÞÞ is known only to C1. After this, C1 computesgq ¼ Epkðcq þ rqÞ, where rq is a random number in ZNknown only to C1. Then, C1 sends gq to C2 and rq to Bob.Upon receiving gq, C2 decrypts it to get the randomizedmajority class label g0q ¼ DskðgqÞ and sends it to Bob. Finally,upon receiving rq from C1 and g0q from C2, Bob computes theoutput class label corresponding to q as cq ¼ g0q _ rq mod N.5.3 Security Analysis of PPkNN under theSemi-Honest ModelFirst of all, we stress that due to the encryption of q and bysemantic security of the Paillier cryptosystem, Bob’s inputquery q is protected from Alice, C1 and C2 in our PPkNNprotocol. Apart from guaranteeing query privacy, the goalof PPkNN is to protect data confidentiality and hide dataaccess patterns.In this paper, to prove a protocol’s security under thesemi-honest model, we adopted the well-known securitydefinitions from the literature of SMC. More specifically, asmentioned in Section 2.3, we adopt the security proofsbased on the standard simulation paradigm [28]. For presentationpurpose, we provide formal security proofs(under the semi-honest model) for Stages 1 and 2 of PPkNNseparately. Note that the outputs returned by each sub-protocolare in encrypted form and known only to C1.5.3.1 Proof of Security for Stage 1As mentioned earlier, the computations involved in Stage 1of PPkNN are given as steps 1 to 3 in Algorithm 4. For simplicity,we consider the messages exchanged between C1and C2 in a single iteration (similar analysis can be deducedfor other iterations).According to Algorithm 4, the execution image of C2 isgiven by PC2 ðPPkNNÞ ¼ fhbi; b0ii j for 1 _ i _ ng where bi isan encrypted value which is random in ZN2 . Also, b0i isderived upon decrypting bi by C2. Remember that, exactlyone of the entries in b0 is 0 and the rest are random numbersin ZN. Without loss of generality, let the simulated image ofC2 be given PSC2ðPPkNNÞ ¼ fha01;i; a02;ii j for 1 _ i _ ng. Herea01;i is randomly generated from ZN2 and the vector a02 is randomlygenerated in such a way that exactly one of theentries is 0 and the rest are random numbers in ZN. SinceEpk is a semantically secure encryption scheme with resultingciphertext size less than ZN2 , we claim that bi is computationallyindistinguishable from a01;i. In addition, since therandom permutation function p is known only to C1, b0 is arandom vector of exactly one 0 and random numbers in ZN.Thus, b0 is computationally indistinguishable from a02. Bycombining the above results, we can conclude thatPC2 ðPPkNNÞ is computationally indistinguishable fromPSC2ðPPkNNÞ. This implies that C2 does not learn anythingduring the execution of Stage 1.On the other hand, the execution image of C1 is given byPC1 ðPPkNNÞ ¼ fU0g where U0 is an encrypted value sent byC2 (at step 3(c) of Algorithm 4). Let the simulated image ofC1 in Stage 1 be PSC1ðPPkNNÞ ¼ fa0g. Here the value of a0 israndomly generated from ZN2 . Since Epk is a semanticallysecure encryption scheme with resulting ciphertexts in ZN2 ,we claim that U0 is computationally indistinguishable froma0. This implies that PC1 ðPPkNNÞ is computationally indistinguishablefrom PSC1ðPPkNNÞ. Hence, C1 cannot learnanything during the execution of Stage 1 in PPkNN. Combiningall these results together, it is clear that Stage 1 issecure under the semi-honest model.In each iteration, it is worth pointing out that C1 andC2 do not know which data record belongs to currentglobal minimum. Thus, data access patterns are protectedfrom both C1 and C2. Informally speaking, at step 3(c) ofAlgorithm 4, a component-wise decryption of b revealsthe tuple that satisfy the current global minimum distanceto C2. However, due to the random permutation by C1, C2cannot trace back to the corresponding data record. Also,note that decryption operations on vector b by C2 willresult in exactly one 0 and the rest of the results are randomnumbers in ZN. Similarly, since U0 is an encryptedvector, C1 cannot know which tuple corresponds to currentglobal minimum distance.5.3.2 Security Proof for Stage 2In a similar fashion, we can formally prove that Stage 2 ofPPkNN is secure under the semi-honest model. Briefly,since the sub-protocols SF, SBD, and SMAXw are secure, noinformation is revealed to C2. Also, the operations performedby C1 are entirely on encrypted data and thus noinformation is revealed to C1.Furthermore, the output data of Stage 1 which are passedas input to Stage 2 are in encrypted format. Therefore, thesequential composition of the two stages lead to our PPkNNprotocol and we claim it to be secure under the semi-honestmodel according to the Composition Theorem [28]. In particular,based on the above discussions, it is clear that theproposed PPkNN protocol protects the confidentiality ofthe data, the user’s input query, and also hides data accesspatterns from Alice, C1; and C2. Note that Alice does notparticipate in any computations of PPkNN.5.4 Security under the Malicious ModelThe next step is to extend our PPkNN protocol into a secureprotocol under the malicious model. Under the maliciousmodel, an adversary (i.e., either C1 or C2) can arbitrarilydeviate from the protocol to gain some advantage (e.g.,learning additional information about inputs) over the otherparty. The deviations include, as an example, for C1 (actingas a malicious adversary) to instantiate the PPkNN protocolwith modified inputs (say Epkðq0Þ and Epkðt0iÞÞ and to abortthe protocol after gaining partial information. However, inPPkNN, it is worth pointing out that neither C1 nor C21270 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015knows the results of Stages 1 and 2. In addition, all the intermediateresults are either random or pseudo-random values.Thus, even when an adversary modifies theintermediate computations he/she cannot gain any additionalinformation. Nevertheless, as mentioned above, theadversary can change the intermediate data or performcomputations incorrectly before sending them to the honestparty which may eventually result in the wrong output.Therefore, we need to ensure that all the computations performedand messages sent by each party are correct.Remember that the main goal of SMC is to ensure thehonest parties to get the correct result and to protect theirprivate input data from the malicious parties. Therefore,under the two-party SMC scenario, if both parties are malicious,there is no point to develop or adopt an SMC protocolat the first place. In the literature of SMC [30], it is the normthat at most one party can be malicious under the two-partyscenario. When only one of the party is malicious, the standardway of preventing the malicious party from misbehavingis to let the honest party validate the other party’s workusing zero-knowledge proofs [31]. However, checking thevalidity of operations at each step of PPkNN can significantlyincrease the cost.An alternative approach, as proposed in [32], is to instantiatetwo independent executions of the PPkNN protocol byswapping the roles of the two parties in each execution. Atthe end of the individual executions, each party receives theoutput in encrypted form. This is followed by an equalitytest on their outputs. More specifically, suppose Epk1 ðcq;1Þand Epk2 ðcq;2Þ be the outputs received by C1 and C2 respectively,where pk1 and pk2 are their respective public keys.Note that the outputs in our case are in encrypted formatand the corresponding ciphertexts (resulted from the twoexecutions) are under two different public key domains.Therefore, we stress that the equality test based on the additivehomomorphic encryption properties which was used in[32] is not applicable to our problem. Nevertheless, C1 andC2 can perform the equality test based on the traditionalgarbled-circuit technique [33].5.5 Complexity AnalysisThe total computation complexity of Stage 1 is bounded byOðn _ ðl þ m þ k _ l _ log2 nÞÞ encryptions and exponentiations.On the other hand, the total computation complexityof Stage 2 is bounded by Oðw _ ðl þ k þ l _ log2 wÞÞ encryptionsand exponentiations. Due to space limitations, werefer the reader to [5] for detailed complexity analysis ofPPkNN. In general, as w _ n, the computation cost of Stage1 should be significantly higher than that of Stage 2. Thisobservation is further justified by our empirical resultsgiven in the next section.6 EMPIRICAL RESULTSIn this section, we discuss some experiments demonstratingthe performance of our PPkNN protocol under differentparameter settings. We used the Paillier cryptosystem [4] asthe underlying additive homomorphic encryption schemeand implemented the proposed PPkNN protocol in C. Variousexperiments were conducted on a Linux machine withan Intel Xeon Six-Core CPU 3.07 GHz processor and 12 GBRAM running Ubuntu 12.04 LTS. To the best of our knowledge,our work is the first effort to develop a secure k-NNclassifier under the semi-honest model. There is no existingwork to compare with our approach. Hence, we evaluatethe performance of our PPkNN protocol under differentparameter settings.6.1 Dataset and Experimental SetupFor our experiments, we used the Car Evaluation datasetfrom the UCI KDD archive [34]. It consists of 1,728 records(i.e., n ¼ 1; 728) and six attributes (i.e., m ¼ 6). Also, there isa separate class attribute and the dataset is categorized intofour different classes (i.e., w ¼ 4). We encrypted this datasetattribute-wise, using the Paillier encryption whose key sizeis varied in our experiments, and the encrypted data werestored on our machine. Based on our PPkNN protocol, wethen executed a random query over this encrypted data. Forthe rest of this section, we do not discuss about the performanceof Alice since it is a one-time cost. Instead, we evaluateand analyze the performances of the two stages inPPkNN separately.6.2 Performance of PPkNNWe first evaluated the computation costs of Stage 1 inPPkNN for varying number of k-nearest neighbors. Also,the Paillier encryption key size K is either 512 or 1,024 bits.The results are shown in Fig. 2a. For K ¼ 512 bits, the computationcost of Stage 1 varies from 9.98 to 46.16 minuteswhen k is changed from 5 to 25, respectively. On the otherhand, when K ¼ 1;024 bits, the computation cost of Stage 1varies from 66.97 to 309.98 minutes when k is changed from5 to 25, respectively. In either case, we observed that thecost of Stage 1 grows almost linearly with k. For any givenk, we identified that the cost of Stage 1 increases by almost afactor of 7 whenever K is doubled. E.g., when k ¼ 10, StageFig. 2. Computation costs of PPkNN for varying number of k nearest neighbors and encryption key size K.SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 12711 took 19.06 and 127.72 minutes to generate the encryptedclass labels of the 10 nearest neighbors under K ¼ 512 and1024 bits, respectively. Moreover, when k ¼ 5, we observethat around 66.29 percent of cost in Stage 1 is accounted dueto SMINn which is initiated k times in PPkNN (once in eachiteration). Also, the cost incurred due to SMINn increasesfrom 66.29 to 71.66 percent when k is increased from 5 to 25.We now evaluate the computation costs of Stage 2 forvarying k and K. As shown in Fig. 2b, for K ¼ 512 bits, thecomputation time for Stage 2 to generate the final class labelcorresponding to the input query varies from 0.118 to 0.285seconds when k is changed from 5 to 25. On the other hand,for K ¼ 1; 024 bits, Stage 2 took 0.789 and 1.89 secondswhen k ¼ 5 and 25, respectively. The low computation costsof Stage 2 were due to SMAXw which incurs significantlyless computations than SMINn in Stage 1. This further justifiesour theoretical analysis in Section 5.5. Note that, in ourdataset, w ¼ 4 and n ¼ 1;728. Like in Stage 1, for any givenk, the computation time of Stage 2 increases by almost a factorof 7 whenever K is doubled. E.g., when k ¼ 10, the computationtime of Stage 2 varies from 0.175 to 1.158 secondswhen the encryption key size K is changed from 512 to1,024 bits. As shown in Fig. 2b, a similar analysis can beobserved for other values of k and K.It is clear that the computation cost of Stage 1 is significantlyhigher than that of Stage 2 in PPkNN. Specifically,we observed that the computation time of Stage 1 accountsfor at least 99 percent of the total time in PPkNN. For example,when k ¼ 10 and K ¼ 512 bits, the computation costs ofStage 1 and 2 are 19.06 minutes and 0.175 seconds, respectively.Under this scenario, cost of Stage 1 is 99.98 percent ofthe total cost of PPkNN. We also observed that the totalcomputation time of PPkNN grows almost linearly with nand k.6.3 Performance Improvement of PPkNNWe now discuss two different ways to boost the efficiency ofStage 1 (as the performance of PPkNN depends primarilyon Stage 1) and empirically analyze their efficiency gains.First, we observe that some of the computations in Stage 1can be pre-computed. For example, encryptions of randomnumbers, 0 and 10s can be pre-computed (by the correspondingparties) in the offline phase. As a result, the onlinecomputation cost of Stage 1 (denoted by SRkNNo) isexpected to be improved. To see the actual efficiency gainsof such a strategy, we computed the costs of SRkNNo andcompared them with the costs of Stage 1 without an offlinephase (simply denoted by SRkNN) and the results forK ¼ 1;024 bits are shown in Fig. 2c. Irrespective of the valuesof k, we observed that SRkNNo is around 33 percentfaster than SRkNN. E.g., when k ¼ 10, the computationcosts of SRkNNo and SRkNN are 84.47 and 127.72 minutes,respectively (boosting the online running time of Stage 1 by33.86 percent).Our second approach to improve the performance ofStage 1 is by using parallelism. Since operations on datarecords are independent of one another, we claim that mostcomputations in Stage 1 can be parallelized. To empiricallyevaluate this claim, we implemented a parallel version ofStage 1 (denoted by SRkNNp) using OpenMP programmingand compared its cost with the costs of SRkNN (i.e., theserial version of Stage 1). The results for K ¼ 1;024 bits areshown in Fig. 2c. The computation cost of SRkNNp variesfrom 12.02 to 55.5 minutes when k is changed from 5 to 25.We observe that SRkNNp is almost six times more efficientthan SRkNN. This is because our machine has six cores andthus computations can be run in parallel on six separatethreads. Based on the above discussions, it is clear that efficiencyof Stage 1 can indeed be improved significantly usingparallelism.On the other hand, Bob’s computation cost in PPkNNis mainly due to the encryption of his input query. In ourdataset, Bob’s computation cost is 4 and 17 millisecondswhen K is 512 and 1,024 bits, respectively. It is apparentthat PPkNN is very efficient from Bob’s computationalperspective which is especially beneficial when he issuesqueries from a resource-constrained device (such asmobile phone and PDA).6.3.1 A Note on PracticalityOur PPkNN protocol is not very efficient without utilizingparallelization. However, ours is the first work to propose aPPkNN solution that is secure under the semi-honestmodel. Due to rising demands for data mining as a servicein cloud, we believe that our work will be very helpful tothe cloud community to stimulate further research alongthat direction. Hopefully, more practical solutions toPPkNN will be developed (either by optimizing our protocolor investigating alternative approaches) in the nearfuture.7 CONCLUSIONS AND FUTURE WORKTo protect user privacy, various privacy-preserving classificationtechniques have been proposed over the past decade.The existing techniques are not applicable to outsourceddatabase environments where the data resides in encryptedform on a third-party server. This paper proposed a novelprivacy-preserving k-NN classification protocol overencrypted data in the cloud. Our protocol protects the confidentialityof the data, user’s input query, and hides the dataaccess patterns. We also evaluated the performance of ourprotocol under different parameter settings.Since improving the efficiency of SMINn is an importantfirst step for improving the performance of our PPkNN protocol,we plan to investigate alternative and more efficientsolutions to the SMINn problem in our future work. Also,we will investigate and extend our research to other classificationalgorithms.ACKNOWLEDGMENTSThe authors wish to thank the anonymous reviewers fortheir invaluable feedback and suggestions. This work hasbeen partially supported by the US National Science Foundationunder grant CNS-1011984.TA 1273

Innovative Schemes for Resource Allocation in the Cloud for Media Streaming Applications

05/08/201902/07/2019 by admin

—Media streaming applications have recently attracted a large number of users in the Internet. With the advent of thesebandwidth-intensive applications, it is economically inefficient to provide streaming distribution with guaranteed QoS relying only oncentral resources at a media content provider. Cloud computing offers an elastic infrastructure that media content providers (e.g., Videoon Demand (VoD) providers) can use to obtain streaming resources that match the demand. Media content providers are charged forthe amount of resources allocated (reserved) in the cloud. Most of the existing cloud providers employ a pricing model for the reservedresources that is based on non-linear time-discount tariffs (e.g., Amazon CloudFront and Amazon EC2). Such a pricing scheme offersdiscount rates depending non-linearly on the period of time during which the resources are reserved in the cloud. In this case, an openproblem is to decide on both the right amount of resources reserved in the cloud, and their reservation time such that the financial coston the media content provider is minimized. We propose a simple—easy to implement—algorithm for resource reservation thatmaximally exploits discounted rates offered in the tariffs, while ensuring that sufficient resources are reserved in the cloud. Based onthe prediction of demand for streaming capacity, our algorithm is carefully designed to reduce the risk of making wrong resourceallocation decisions. The results of our numerical evaluations and simulations show that the proposed algorithm significantly reducesthe monetary cost of resource allocations in the cloud as compared to other conventional schemes.Index Terms—Media streaming, cloud computing, non-linear pricing models, network economicsÇ1 INTRODUCTIONMEDIA streaming applications have recently attractedlarge number of users in the Internet. In 2010, thenumber of video streams served increased 38.8 percent to24.92 billion as compared to 2009 [1]. This huge demand createsa burden on centralized data centers at media contentproviders such as Video-on-Demand (VoD) providers tosustain the required QoS guarantees [2]. The problembecomes more critical with the increasing demand forhigher bit rates required for the growing number of higherdefinitionvideo quality desired by consumers. In thispaper, we explore new approaches that mitigate the cost ofstreaming distribution on media content providers usingcloud computing.A media content provider needs to equip its data-centerwith over-provisioned (excessive) amount of resources inorder to meet the strict QoS requirements of streaming traffic.Since it is possible to anticipate the size of usage peaksfor streaming capacity in a daily, weekly, monthly, andyearly basis, a media content provider can make long terminvestments in infrastructure (e.g., bandwidth and computingcapacities) to target the expected usage peak. However,this causes economic inefficiency problems in view of flashcrowdevents. Since data-centers of a media content providerare equipped with resources that target the peakexpected demand, most servers in a typical data-center of amedia content provider are only used at about 30 percent oftheir capacity [3]. Hence, a huge amount of capacity at theservers will be idle most of the time, which is highly wastefuland inefficient.Cloud computing creates the possibility for media contentproviders to convert the upfront infrastructure investmentto operating expenses charged by cloud providers (e.g., Netflix moved its streaming servers to Amazon WebServices (AWS) [4], [5]). Instead of buying over-provisionedservers and building private data-centres, media contentproviders can use computing and bandwidth resources ofcloud service providers. Hence, a media content providercan be viewed as a re-seller of cloud resources, where itpays the cloud service provider for the streaming resources(bandwidth) served from the cloud directly to clients of themedia content provider. This paradigm reduces theexpenses of media content providers in terms of purchaseand maintenance of over-provisioned resources at theirdata-centres.In the cloud, the amount of allocated resources can bechanged adaptively at a fine granularity, which is commonlyreferred to as auto-scaling. The auto-scaling ability ofthe cloud enhances resource utilization by matching thesupply with the demand. So far, CPU and memory are thecommon resources offered by the cloud providers (e.g.,Amazon EC2 [6]). However, recently, streaming resources(bandwidth) have become a feature offered by many cloudproviders to users with intensive bandwidth demand (e.g.,Amazon CloudFront and Octoshape) [5], [7], [8], [9]._ A. Alasaad and H.M. Behairy are with the National Center for Electronics,Communications, and Photonics, King Abdulaziz City for Science andTechnology, Riyadh, Saudi Arabia.E-mail: {alasaad, hbehairy}@kacst.edu.sa._ K. Shafiee and V.C.M. Leung are with the Department of Electrical andComputer Engineering, University of British Columbia, Vancouver, BC,Canada. E-mail : {kshafiee, vleung}@ece.ubc.ca.Manuscript received 7 Nov. 2013; revised 23 Jan. 2014; accepted 24 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.Recommended for acceptance by H. Wu.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316827IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 10211045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.The delay sensitive nature of media streaming trafficposes unique challenges due to the need for guaranteedthroughput (i.e., download rate no smaller than the videoplayback rate) in order to enable users to smoothly watchvideo content on-line. Hence, the media content providerneeds to allocate streaming resources in the cloud such thatthe demand for streaming capacity can be sustained at anyinstant of time.The common type of resource provisioning plan that isoffered by cloud providers is referred to as on-demandplan. This plan allows the media content provider to purchaseresources upon needed. The pricing model that cloudproviders employ for the on-demand plan is the pay-peruse.Another type of streaming resource provisioning plansthat is offered by many cloud providers is based on resourcereservation. With the reservation plan, the media contentprovider allocates (reserves) resources in advance and pricingis charged before the resources are utilized (upon receivingthe request by the cloud provider, i.e., prepaidresources). The reserved streaming resources are basicallythe bandwidth (streaming data-rate) at which the cloud providerguarantees to deliver to clients of the media contentprovider (content viewers) according to the required QoS.In general, the prices (tariffs) of the reservation plan arecheaper than those of the on-demand plan (i.e., time discountrates are only offered to the reserved (prepaid)resources). We consider a pricing model for resource reservationin the cloud that is based on non-linear time-discounttariffs. In such a pricing scheme, the cloud serviceprovider offers higher discount rates to the resourcesreserved in the cloud for longer times. Such a pricingscheme enables a cloud service provider to better utilize itsabundantly available resources because it encourages consumersto reserve resources in the cloud for longer times.This pricing scheme is currently being used by many cloudproviders [10]. See for example the pricing of virtualmachines (VM) in the reservation phase defined by AmazonEC2 in February 2010. In this case, an open problem isto decide on both the optimum amount of resourcesreserved in the cloud (i.e., the prepaid allocated resources),and the optimum period of time during which thoseresources are reserved such that the monetary cost on themedia content provider is minimized. In order for a mediacontent provider to address this problem, prediction offuture demand for streaming capacity is required to helpwith the resource reservation planning. Many methodshave been proposed in prior works to predict the demandfor streaming capacity [11], [12], [13], [14].Our main contribution in this paper is a practical—easyto implement—Prediction-Based Resource Allocation algorithm(PBRA) that minimizes the monetary cost of resourcereservation in the cloud by maximally exploiting discountedrates offered in the tariffs, while ensuring that sufficientresources are reserved in the cloud with some level ofconfidence in probabilistic sense. We first describe the systemmodel. We formulate the problem based on the predictionof future demand for streaming capacity (Section 3).We then describe the design of our proposed algorithm forsolving the problem (Section 4).The results of our numerical evaluations and simulationsshow that the proposed algorithms significantly reduce themonetary cost of resource allocations in the cloud as comparedto other conventional schemes.2 RELATED WORKThe prediction of CPU utilization and user access demandfor web-based applications has been extensively studied inthe literature. A prediction method has been proposed withrespect to upcoming CPU utilization pattern demandsbased on neural networking and linear regression that is ofinterest in e-commerce applications [15]. Y. Lee et al. proposeda prediction method based on radial basis function(RBF) networks to predict the user access demand requestfor web type of services in web-based applications [16].Although the demand prediction for CPU utilization andweb applications has been studied for a relatively longperiod of time, the prediction of demand for media streaminghas gained popularity more recently [11], [12], [13], [14].The access behaviour of users in peer-to-peer (P2P) streamingwith time-series analysis techniques using non-stationarytime-series models was predicted in [11]. The method oftime-series prediction based on wavelet analysis was studiedin [12]. In [13], principal component analysis isemployed by the authors to extract the access pattern ofstreaming users. Although most of the above studies predictthe average streaming capacity demands, few papers havealso studied the volatility of the capacity demand, i.e., thedemand variance at any future point in time, which yieldsmore accurate risk factors [14]. The prediction of streamingbandwidth demand is outside the scope of this paper. Inthis work, we formulate the problem considering a givenprobability distribution function of prediction of futuredemand for streaming bandwidth. In addition to demandprediction for resource reservation, other relevant studieshave addressed the appropriate joint reservation of bandwidthresources on multiple cloud service providers withthe purpose of maximizing bandwidth utilization [12], [14].In [17], an adaptive resource provisioning scheme is presentedthat optimizes the bandwidth utilization whilesatisfying the required levels of QoS. Maximization of bandwidthutilization in turn helps cloud service providersreduce their expenses and maximize their revenues. In [18],an optimization framework for making dynamic resourceallocation decisions under risky and uncertain operatingenvironments was developed to maximize revenue whilereducing operating costs. This framework considered multipleclient QoS classes under uncertainty of workloads.Recently, streaming resources (e.g., bandwidth) havebecome a feature offered by many cloud providers to contentproviders with intensive bandwidth demand. Thestreaming of media content to content viewers located atdifferent geographical regions at guaranteed data-rate is apart of the service offered by the cloud provider. The commonway of implementing this service in the cloud is byhaving multiple data-centres inside the networks of theaccess connection providers (e.g., Internet Service Providers,ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20]. Cloud service providers may need tonegotiate contracts with a number of ISPs to co-locate theirservers into the networks of those ISPs. In this regard,another group of papers have focused on studying different1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015types of contracts between cloud service providers and ISPswith the purpose of minimizing the expenses of cloud providers[21]. However, an interesting design approach is tolook at the resource reservation problem from the viewpointof content providers. Obviously, content providers are moreinterested in minimizing their costs, i.e., the amount ofmoney that they are charged directly by cloud providers.To the best of our knowledge, very few studies haveinvestigated the problem of optimizing resource reservationwith the objective of minimizing the monetary costs for contentproviders. A good example is presented in [22],wherein a resource reservation optimization problem wasformulated to minimize the costs of content providers, socalledcloud consumers, using a stochastic programmingmodel. In the process of problem formulation, uncertaindemand and uncertain cloud providers’ resource prices areconsidered. In contrast, the optimization problem formulatedin our work takes into account a given probability distributionfunction obtained from aforementioned studiesfor the prediction of media streaming demands. Furthermore,the problem of cost minimization is addressed by utilizingthe discounted rates offered in the non-linear tariffs.To the best of our knowledge, none of the previous papershas investigated the problem of cost minimization for mediacontent providers in terms of monetary expenses by takinginto account both the penalties caused by the over-provisionedor under-provisioned reserved resources, and theadvance purchase of resources at cloud providers for justthe right period of time.3 SYSTEM MODEL AND PROBLEM FORMULATIONThe system model that we advocate in this paper for mediastreaming using cloud computing consists of the followingcomponents (Fig. 1)._ Demand forecasting module, which predicts thedemand of streaming capacity for every video channelduring future period of time._ Cloud broker, which is responsible on behalf of themedia content provider for both allocating theappropriate amount of resources in the cloud, andreserving the time over which the required resourcesare allocated. Given the demand prediction, the brokerimplements our proposed algorithm to makedecision on resource allocations in the cloud.Both the demand forecasting module and thecloud broker are located in the media content providersite._ Cloud provider, which provides the streamingresources and delivers streaming traffic directly tomedia viewers.In this paper, we consider the case, wherein the cloudprovider charges media content providers for the reservedresources according to the period of time during which theresources are reserved in the cloud. In this case, the cloudprovider offers higher discount rates to the resourcesreserved in the cloud for longer times.Non-linear time-discount is a very popular pricingmodel. Non-linear tariffs are those with marginal rates varyingwith quantity purchased and time rented. Time discountrates are available in purchasing most types of goods.Products or services with time usage (e.g., rental cars, rentalreal-estates, loans, long distance telephone cards, photocopiers)are typically offered with variety of plans (pricingschemes) depending on the period of time the product isconsumed (reserved). It has been shown that such pricingschemes enable sellers to increase their revenues [23]. Manycloud providers also use such a pricing scheme [10]. See forexample pricing of virtual machines in reservation phasedefined by Amazon EC2 in February 2010. An example oftariffs using such a pricing scheme is shown in Fig. 2. Wecan see that the tariff is a function of both units of allocatedresources and reservation time.We observe the following dilemma: how can the mediacontent provider reserve sufficient resources in the cloud—based on the prediction of future streaming demand—suchthat no resource wastage is incurred, while QoS for theactual (real) streaming traffic is maintained with some levelof confidence (h) in probabilistic sense? Moreover, how canthe media content provider utilize the non-linear tariffs(time discount rates offered to the reserved (prepaid)resources) to minimize its monetary cost?Consider a video channel offered by a media content provider.Let DðtÞ be the actual demand for streaming capacityof the video channel at an instant of time t, and measured asthe number of users that stream the channel at instant oftime t multiplied by the data rate required for every downloadinguser to meet QoS guarantees. It has been shownthat DðtÞ is a random process that follows a log-normalFig. 2. An example of tariffs as function of allocated resources and reservationtime.Fig. 1. System model.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1023distribution with mean E½DðtÞ_ and variance (s) characterizedin [11] and [14], respectively.We denote the amount of streaming bandwidth that themedia content provider allocates in the cloud at any timeinstant t by AllocðtÞ. Since DðtÞ is a random process, themedia content provider needs to maintain reserved resourcesin the cloud AllocðtÞ such that in any instant of time,ProbabilityðDðtÞ _ AllocðtÞÞ _ h; (1)where h is a pre-determined threshold (level of confidence).Note that a higher h means a higher degree ofconfidence, in a probabilistic sense, that the reservedresources in the cloud AllocðtÞ meet the QoS guaranteesfor the actual streaming traffic at any future time instant t.However, increasing h increases the probability of wastageof reserved bandwidth (i.e., over-subscribed cost).Hence, proper selection of h is necessary. We shall proposean algorithm that determines the best value of h inSection 5. In this section, our objective is to find the rightamount of reserved resources and their corresponding reservationtime such that the monetary cost required forstreaming a video content (channel) is minimized giventhe constraint in Eq. (1).4 ALGORITHM DESIGNWe summarize the assumptions that we use in our analysisas follows:1) We assume that upon receiving the resource allocationrequest by the cloud provider from the mediacontent provider, the resources required are immediatelyallocated in the cloud, i.e., updating the cloudconfiguration and launching instances in cloud datacentresincurs no delay.2) Since the only resource that we consider in this workis bandwidth, it would be important to delve intothe relation between the cloud provider and contentdelivery networks (CDN). However, we assume thatthe provisioning of media content to media viewers(clients of the media content provider) located at differentgeographical regions at guaranteed data-rateis a part of the service offered by the cloud provider.The common way of implementing this service inthe cloud is by having multiple data-centres insidethe networks of the access connection providers(e.g., ISPs) located at appropriate geographical locations(Fig. 1) [5], [19], [20].3) We assume that the media content provider ischarged for the reserved resources in the cloudupon making the request for resource reservation(i.e., prepaid resources); and therefore, the mediacontent provider cannot revoke, cancel, or change arequest for resource reservation previously submittedto the cloud.4) In clouds, tariffs (prices of different amount ofreserved resources in $ per unit of reservation time)are often given in a tabular form. Therefore, thecloud service provider requires a minimum reservationtime for any allocated resources, and onlyallows discrete levels (categories) of the amount ofallocated resources in the cloud. See for example thereservation phase in the Amazon CloudFrontresource provisioning plans [7].We take into account the aforementioned constraints andpropose a practical—easy to implement—algorithm forresource reservation in the cloud, such that the financialcost on the media content provider is minimized.Suppose that the media content provider can predict thedemand for streaming capacity of a video channel (i.e., thestatistical expected value of the demand E½DðtÞ_ is known)over a future period of time L using one of the methods in[11], [12], [13], [14]. The content provider reserves resourcesin the cloud according to the predicted demand. The proposedalgorithm is based on time-slots with varied durations(sizes). In every time-slot, the media content providermakes a decision to reserve amount of resources in thecloud. Both the amount of resources to be reserved and theperiod of time over which the reservation is made (durationof time-slots) vary from one time-slot to another, and aredetermined in our algorithm to yield the minimum overallmonetary cost (Fig. 3).We alternatively call a time-slot a window, and denote thewindow size (duration of the time-slot) by w. Since theactual demand varies during a window size, while allocatedresources in the cloud remain the same for the entire windowsize (according to the third assumption above), thealgorithm needs to reserve resources in every window jthat are sufficient to handle the maximum predicteddemand for streaming capacity during that window withsome probabilistic level of confidence h.We denote the amount of reserved resources in window jby Allocj. Since the decision on the amount of reservedresources is affected by the wrong prediction of futurestreaming demand, our on-line algorithm is carefullydesigned to obtain accurate demand prediction (by enablinga mechanism that continuously updates the demand forecastmodule according to the actual demand received at themedia content provider over time) in order to reduce therisk of making wrong resource reservation decisions (Fig. 1).We denote the monetary cost of the reserved resourcesduring window j by Costðwj;AllocjÞ, and can becomputed asCostðwj; AllocjÞ ¼ tariffðwj; AllocjÞ _ wj; (2)where tariffðwj; AllocjÞ represents the price (in $ per timeunit) charged by the cloud provider for amount of resourcesFig. 3. PBRA algorithm design.1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015Allocj reserved for period of time (window size) wj. Notethat the values of tariff and Cost in any window j dependon both the amount of allocated resources (Allocj) and theperiod of time over which resources are reserved (wj). Alsonote that the algorithm runs on-the-fly. More specifically,the demand forecast module predicts streaming capacitydemand in the upcoming period of time L and feeds thisinformation to our algorithm. The algorithm upon receivingthe demand prediction, computes the right size of windowj (i.e., w_j ), and the right amount of reserved resources inwindow j (i.e., Alloc_j ), such that the cost of the reservedresources during window j (i.e., Costðwj; AllocjÞ in (2)) isminimized; or equivalently, the discounted rates offered inthe tariffs are maximally utilized.Hence, the objective of our algorithm is to minimizeCostðwj; AllocjÞ 8j, subject toProbabilityðDðtÞ _ AllocðtÞÞ _ h; 8 t 2 L:In other words, our objective is to minimize the monetarycost of reserved resources such that the amount of reservedresources at any instant of time is guaranteed to meet theactual demand with probabilistic confidence equals to h. Aswe have discussed earlier, DðtÞ is a random process that followsa log-normal distribution with mean E½DðtÞ_ and variance(s) characterized in [11] and [14], respectively. Thus,using the constraint above, and for any window size wj, wecan compute the minimum amount of required reservedresources during window j (Allocj) by solving the followingformula for AllocjZAllocj01x _ sffiffiffiffiffiffi2p p e 1 2ðlnðxÞ mmaxs Þ2dx ¼ h; (3)where mmax is the maximum value of the predictedstreaming demand during the window j (i.e., mmax ¼ argmaxðE½DðtÞ_Þ 8 t 2 wj). Note that the Equation (3) followsfrom the log-normal probabilistic distribution of thedemand for streaming capacity.As we have discussed earlier, the cloud service provideroften requires a minimum reservation time for any allocatedresources (wmin), and only allows discrete levels (categories)of reservation times for any amount of allocated resourcesin the cloud. We therefore, assume that any reservationtime required at the cloud has to be in multiplicative orderof wmin (i.e., wj ¼ k _ wmin, where k is a positive integer).Thus, the algorithm employs a trial window (wh) to assist inmaking optimum decision on the size of every window j. Inparticular, for every window j, the algorithm starts an iterationprocess with a trial window of size wh ¼ wmin, andcomputes the cost rate (Xh ¼ tariffðwh; AllochÞ, where h isiteration index), and Alloch is computed by solving Eq. (3)for Alloc.Recall that due to the time discount rates offered in thetariffs, increasing the time during which the allocatedresources are reserved may lead to less monetary cost(higher discounted rate) on the media content provider(Fig. 2). However, increasing the window size (time-slot)significantly may also result in high over-provisioning(over-subscribed) cost as the media content provider has toallocate resources in the cloud that meet the highestdemand during the window period. Thus, in order torecognize whether the cost is decreasing or increasing withincreasing the window size, the trial window size (wh) isincreased one wmin unit in every iteration (i.e.,wh ¼ wh þ wmin) and the cost rate of this new trial windowsize is computed (Xhþ1). The algorithm keeps increasing thetrial window size until wh ¼ L in order to scan the entireperiod of time over which the demand was predicted (L)(Fig. 3), and finds the value of wh that yields the minimumcost; that is the optimum size of window j (w_j ). Since L isthe period of time over which the future demand is predicted,then wmin _ w_j _ L.During every window, the media content providerreceives the real (actual) streaming demand for the videochannel, which may be different from the predicteddemand. According to the actual demand, the demandforecast module updates its prediction and feeds thealgorithm with a newly predicted demand for anotherfuture period of time L (Fig. 1). The algorithm uponreceiving the updated demand prediction, computes theoptimum size of the next window, and reserves optimumresources in the next window, and so on. Thepseudo code for the proposed algorithm is shown inAlgorithm 1. In order to further clarify operations of theproposed algorithm (which we call it Prediction-BasedResource Allocation algorithm), an example is given inthe following.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 10254.1 Example: Finding the Right Amount of ReservedResources in Window j and Their ReservationtimeConsider the normalized predicted streaming demandgiven in Fig. 4 for a future period of time L ¼ 12. Letwmin ¼ 1; and let h ¼ 0:75. Assume that the amount ofreserved resources in the cloud can only take integer numbersof unit of resources (i.e., cloud provider applies certainlevels (categories) on the amount of allowed reservedresources, AllocðtÞ 2 f1; 2; 3; . . .g.For the given predicted demand, our algorithm findsthe optimum size of every window j and optimumamount of reserved resources in window j as follows.The algorithm starts iterations to determine the size ofthe first window (i.e., wj¼1). In the first iteration (h ¼ 1),wh¼1 ¼ 1, we can see that the maximum predicteddemand when wh¼1 ¼ 1 is 0:63 (Fig. 4). Thus, we havemmaxh ¼ 0:63. Using Eq. (3), we have Alloch¼1 ¼ 0:81.Since the cloud allows only discrete levels for reservedresources in the cloud, then Alloch¼1 must be rounded tothe nearest upper value allowed in the cloud. Thus,Alloch¼1 ¼ 1. Using tariff functions shown in Fig. 2, wehave the cost rate Xh ¼ tariffðwh¼1 ¼ 1; Alloch¼1 ¼ 1Þ ¼ 11.The iterations continue until wh ¼ L.We summarize the results of all iterations h performedfor window j ¼ 1 using our proposed algorithm in Table 1.From the table, we can see that the minimum value of costrate Xh is when h_ ¼ 10. Hence, the optimum window sizeis w_j ¼1 ¼ wh¼10 ¼ 10, and the optimum amount of reservedresources during window j ¼ 1 is Alloc_j¼1 ¼ Alloch¼10 ¼ 2.Similarly, we can find the optimum window size and optimumamount of resources in the next window (j ¼ 2) givenan updated prediction of the demand in another period offuture time L.5 HYBRID APPROACH FOR RESOURCEPROVISIONINGIn this section, we consider the case, wherein the cloud provideroffers two different types of streaming resource provisioningplans: the reservation plan and the on-demandplan. With the reservation plan, the media content providerreserves resources in advance and pricing is charged beforethe resources are utilized (upon receiving the request at thecloud provider, i.e., prepaid resources). With the ondemandplan, the media content provider allocates streamingresources upon needed. Pricing in the on-demand planis charged by pay-per-use basis. In general, the prices (tariffs)of the reservation plan are cheaper than those of the ondemandplan (i.e., time discount rates are only offered tothe reserved (prepaid) resources). Amazon CloudFront [7],Amazon EC2 [6], GoGrid [24], MS Azure, Op-Source, andTerre-mark are examples of cloud providers which offerInfrastructure-as-a-Service (IaaS) with both plans [10].When the media content provider only uses theresource reservation plan, the under-provisioning problemcan occur if the reserved (prepaid) resources are unable tofully meet the actual demand due to high fluctuatingdemand or prediction mismatch. Also, over-provisioningproblem can occur if the reserved (prepaid) resources aremore than the actual demand, in which parts of thereserved resources are wasted. However, when the cloudprovider offers both the reservation plan and the ondemandplan, the media content provider can allocateresources in the cloud more efficiently. In particular, themedia content provider can use reservation plan to benefitfrom the time-discounted rate, while use the on-demandplan to dynamically allocate streaming resources to its clientsat the moment when the reserved resources allocatedusing the reservation plan are unable to meet the actualdemand and extra resources are needed to fit the fluctuatedand unpredictable demands (e.g., flash crowd). Wecall this approach hybrid resource provisioning. This hybridapproach eliminates both the over-provisioning (over-subscribed)cost and the under-provisioning problem thatmay occur when using the reservation plan only.In this hybrid resource provisioning approach, tradeoffbetween the amount of resources allocated using the ondemandplan and the amount of resources allocated usingthe reservation plan needs to be adjusted in which thehybrid approach can optimally perform. In this section, wepropose an algorithm for this hybrid resource provisioningapproach that maximally benefits from the time discountedrate offered in the resource reservation plan, while eliminatingany over-provisioning cost of reserved resources suchthat the overall monetary cost of resource allocations in thecloud (including both the reserved resources and the ondemandresources) is minimized.Fig. 4. An example of predicted demand over a period of future timeL ¼ 12.TABLE 1Example: Summary of Results for Iterations Executed for Window j ¼ 11026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015As we have described in the previous section (Section4), the cost of allocated resources using the reservationplan depends on the parameter h. We referred to has the level of confidence. We have shown that usinghigher value of h results in higher amount of reservedresources in the cloud, and vice-versa. However, increasingthe value of h for the reserved resources may lead tothe over-provisioning problem, while decreasing the valueof h may lead to the under-provisioning problem. Sincepricing of resource allocation in the on-demand plan ishigher than the reservation plan, one may erroneouslybelieve that increasing the value of h would alwaysreduce the overall monetary cost since the portion ofreserved (discounted) resources in the cloud is increased.However, reserving too many resources (i.e., using highvalue of h for the reserved (prepaid) resources) may befar from optimal because it may significantly increasethe over-provisioning (over-subscribed) cost. Hence, thishybrid approach requires that the content provider selectthe right value of h for the reserved resources. Our proposedalgorithm in this section computes the optimumvalue of h (h_) that yields the minimum overall monetarycost of resource allocations in the cloud (both reservedand on-demand resources) when the media content provideruses this hybrid resource provisioning approach.Let us again assume that the media content provider canpredict the demand for a future period of time L. Let Chybridbe the price that the media content provider expects topay to the cloud provider for all streaming resource allocatedin the cloud using the hybrid approach (i.e., Chybrid isthe statistical mean of the cost). We can see that Chybrid is thesummation of two terms: the price charged for the reservedresources in every window j using the reservation plan(denoted by CRSVj ), and the expected cost of resources allocatedin the cloud during every window j using the ondemandplan (denoted by CODj ). Hence,Chybrid ¼Xj ðCRSVj þ CODj Þ: (4)Let AllocRSVj be the amount of reserved resources in windowj, while AllocODj be the amount of on-demand resources allocatedin window j. Let tariffðwRSVj ; AllocRSVj Þ be the tariffcharged for the reserved resources in window j, whiletariffðAllocODj Þ be the tariff charged for the on-demandresources in windowj. Note that the cost rate of the resourcesreservation plan, tariffðwRSVj ; AllocRSVj Þ, depends on bothwRSVj and AllocRSVj; while tariffðAllocODj Þ depends only onthe amount of allocated resources AllocODj . This is becauseno time discount rate is offered to the on-demand resources.Let x be a random variable representing the demand forstreaming capacity in any instant of time during window j,and fðxÞ be the probability density function of variable x.Note that when the amount of reserved resources in windowj (AllocRSVj ) is known, CODj can be computed by consideringthe event when AllocRSVj < x < 1. This isbecause when x < AllocRSVj , the amount of reservedresources in the cloud is sufficient to handle the actualstreaming demand and no need to allocate extra resourcesusing the on-demand plan. Thus, we can compute the costof reserved resources in window j (in $) asCRSVj ¼ wj _ tariffðwRSVj ; AllocRSVj Þ; (5)and consequently the expected (statistical mean) cost of theon-demand resources in window j can be computed asCODj ¼ wj _Z1AllocRSVjfðxÞ_tariffðx AllocRSVj Þ dx:(6)We shall consider a log-normal statistical probability distributionfðxÞ as discussed earlier [11], [14]. Thus, fðxÞ inEq. (6) can be written asfðxÞ ¼1x _ sffiffiffiffiffiffi2p p e 12_ðlnðxÞ mmaxs Þ2:As we have described in Section 4, the right amount ofreserved resources in window j (AllocRSVj ) can be determinedgiven the parameter h. Thus, Chybrid in Eq. (4) is afunction of the parameter h only. Our objective is to minimizeChybrid in Eq. (4), or equivalently determining the valueof h that minimizes the overall cost of allocated resourcesusing the hybrid approach. It is straight forward to showthat Chybrid is convex with respect to h. Thus, in order tominimize Chybrid, we need to find the optimum value of h(h_) using Equations (5) and (6).We can see that h_ can be easily solved numerically forevery window j if tariff functions are given (i.e.,tariffðt; AllocRSV ðtÞÞ and tariffðAllocODðtÞÞ for any durationof resource allocation). However, as we have discussedearlier, tariffs are often given in a tabular form. Moreover,the cloud service provider often requires a minimum reservationtime for any allocated resources, and only allowsdiscrete levels (categories) of allocated resources in thecloud. We take into account those constraints and proposean efficient heuristic algorithm for this hybridresource provisioning approach. The pseudo code of theproposed algorithms is shown in Algorithm 2.The algorithm works as follows. Suppose that h takesdiscrete values, and the total possible values of h is S. Forevery window j, the iteration process described in Algorithm1 is performed for every value of h in order to computeboth the right amount of reserved resources(AllocRSVj ) and the right time over which these resourcesare reserved (wRSVj ). When the amount of reserved resourcesin window j is determined, the amount of extra resourcesthat must be allocated using the on-demand plan inorder to fulfil the predicted streaming demand can be easilycomputed as AllocODj ¼ mmax AllocRSVj, where mmaxis the maximum value of the predicted streaming demandduring window j. Thus, the total corresponding cost rateof allocated resources in window j is computed asXh ¼ tariffðRSVj; AllocRSVjÞ þ tariffðAllocODj Þ, where h isthe iteration index. The iteration process continues, andout of all values of Xh computed for different values of h,the algorithm finds h_ corresponding to the minimumvalue. The algorithm is repeated for every window.We can see that the complexity of the proposed algorithm(measured in terms of number of iterations requiredfor every window) is Oð Lwmin _ SÞ. Thus, increasing the size ofALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1027S increases the complexity of the algorithm, but alsoincreases the accuracy of the algorithm. However, the complexityof our algorithm linearly scales with size of the input(S), which means that our algorithm executes efficiently.6 PERFORMANCE EVALUATIONWe first analytically derive a demand prediction functionthat we shall use in our performance evaluations (Section6.1). We then investigate the performance of our simple“on-line” Prediction-Based Resource Allocation algorithmproposed for reserving resources in the cloud, in terms ofboth monetary cost of reserved resources in the cloud andcomplexity (CPU time) (Section 6.2). We then compare theperformance of PBRA proposed for reserving resources inthe cloud against two other schemes: Fixed window sizeresource reservation scheme, and pay-as-you-go resourceallocation scheme (Section 6.2.2). Finally, we evaluate theperformance of our hybrid resource allocation algorithmproposed for the case when the cloud provider offers twostreaming resource provisioning plans: the reservation andon-demand, and show that our algorithm significantlyreduces the overall cost of resource allocation (Section 6.3).6.1 Demand ModelAs we have discussed so far, prediction of the future demandfor streaming capacity is required in order for the media contentprovider (e.g., VoD) to optimally reserve resources inthe cloud. In this section, we use a special case of the demandin which the function of expected (mean) future streamingdemand for a video channel (i.e., E½DðtÞ_) can be easily formulatedanalytically. Specifically, we assume that all mediastreaming demand for a video channel available at a localVoD provider is generated from users located in a single privatenetwork (e.g., users in a college or office campuses).What distinguishes the evolution of interest in a mediacontent among users of a private network from the Internetis that users in a private network are often socially connected(e.g., friends/colleagues in a social network). Thoseusers form a community and share similar interests. Thus,the demand of a media content grows quickly in the privatenetwork as interested users contact others (by either broadcastingthe knowledge about existence of the media contentto their friends in the social network, e.g., facebook, or usingEmail-group broadcast) and make them interested. However,the interest (demand) tapers off when a certain cumulativelevel of interest among users of the private network isreached. For example, a student, in a class of 100 students,can spread the knowledge about a video content to his classmates.If the popularity of this content among students inthe class is 0.2, the evolution of the demand increasesquickly over time as interested users contact others, buttapers off when all potential number of interested studentsin the class (20 students) get interested in the content andviewed the content. When all 20 students finish viewing thevideo content, the life-time of that content in this communitynetwork expires.We analytically characterize this viral evolution of interestin a media content among users of a private network.Let us assume that the number of friends to whom a user isconnected in a social network (node’s degree) at any instantof time on average is N. Let us further assume that a userwho receives the notification about the existence of the contentgets interested with probability p and re-broadcasts thenotification, in turn, to his friends on the social network,where p is the expected popularity of the content amongusers of the private network. We further assume that userswho receive multiple notifications for the same content donot rebroadcast the message.If the social network graph is fully connected (i.e., a notificationabout existence of the content reaches all users inthe private network), we can then use the fluid-flow modelto write the evolution of interest in a media content asdIðtÞdt ¼ IðtÞ½pðN gðtÞ _ NÞ_;where IðtÞ be the total number of interested users in the contentat time t (cumulative interest). ðgðtÞ _ NÞ accounts forthe fraction of N users who received multiple notificationsby time instant t, gðtÞ :¼ IðtÞ NT, where NT is the potential numberof users in the network who will ultimately becomeinterested in the content (NT ¼ 100 in Fig. 5), i.e., NT be themaximum expected level of the content cumulative interestin the private network.The above formula is a second order Bernoulli differentialequation and can be solved asIðtÞ ¼NT _ Ið0ÞIð0Þ þ ðNT Ið0ÞÞe p_N_t ; (7)1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015where Ið0Þ be the number of interested users at time t ¼ 0.We note that IðtÞ has an S-shape (Fig. 5). It shows that thenumber of interested users increases quickly when the contentbecomes available and then gradually decreases andtapers off once the level of interest reaches NT . This is similarto the demand function that was obtained using wordof-mouth spread of information by interested users (Bassmodel). Similar interest evolution was also observed whenmeasuring user interest in a video file on YouTube server[25], and when measuring user interest in popular videohosted on a university infrastructure (CoralCDN) [26].Given the evolution of interest in a media content IðtÞ inEq. (7), we can now use fluid-flow model to write the rate atwhich downloading users are completely served (finishdownloading the media content) asdSðtÞdt ¼ mQ _ ½IðtÞ SðtÞ_;where mQ is the required QoS streaming rate for everydownloading user (measured in bits/second), and SðtÞ isthe number of completely served users at time instant t. Theabove differential equation can be easily solved for SðtÞ.Hence, the expected value of demand for stream capacity ofthe content at any time t (measured in bits/second) isE½DðtÞ_ ¼dSðtÞdt ¼ mQ _ ½IðtÞ SðtÞ_: (8)6.2 Evaluation of the Algorithm (PBRA) Proposedfor Reserving Resources in the CloudThe algorithm that we evaluate in this section is the veryfirst algorithm that was proposed in Section 4 for resourcereservation in the cloud. We used time-discount rates similarto those used in the pricing model employed by AmazonEC2 [6] in order to derive tariff functions that we used inour evaluations. Those tariffs are non-linear functions ofboth the amount of reserved resources and reservationtime. An example of a tariff function that we used in ourevaluations for units of reserved resources equal to 3 isdepicted in Fig. 6. Note that time discounts are given to thereserved resources. For example, we can see that if themedia content provider wants to reserve (prepaid purchase)3 units of streaming resources for 6 time units, then the tariffis 13 $ per unit of reserved time; whereas the tariff is 14:25 ifthe same amount of resources is reserved for only 1 timeunit. We consider a log-normal probability distribution ofthe demand for streaming capacity with mean (i.e., predicteddemand E½DðtÞ_) computed by Eq. (8) for IðtÞ givenin Fig. 5, mQ ¼ 1, and variance of 3.6.2.1 Performance versus ComplexityAs we have discussed in Section 4, our proposed algorithm(PBRA) employs a trial window wtry with size taking valuesin multiplicative order of wmin, where wmin can be definedas the granularity of the resource allocation in the cloud(i.e., it is the minimum reservation time that the cloud providerrequires for any amount of resource reserved in thecloud), and it is measured in units of time. To investigatethe impact of the value of wmin on the performance of ouralgorithm, we compared the financial cost of media streamingwhen using our algorithm for varied sizes of wmin ath ¼ 0:75. To plot the comparison figure, we computed theratio of the overall cost of resource reservation for everyvalue of wmin to the overall cost when using wmin ¼ 1 (i.e.,normalized cost) (Fig. 7).Fig. 5. The evolution of interest in the video channel.Fig. 6. A tariff function for units of reserved resources equal to 3.Fig. 7. Performance versus complexity of the PBRA algorithm forresource reservation in the cloud.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS 1029The results show that the algorithm provides the leastcost of resource allocation in the cloud when wmin ¼ 1.Hence, we can see that the finer granularity that we havein resource allocation in the cloud (i.e., the smaller value ofwmin), the better performance we get in our algorithm. Thebetter performance, however, comes at higher algorithmcomplexity, where complexity is measured in terms of totalnumber of iterations (h). We can see that h is higher forsmaller wmin (Fig. 7). However, even for the highest numberof iterations (when wmin ¼ 1), total CPU time was only1:02 second using Intel(R) Core(TM)2 Quad CPU @2.82 GHz. If we compare this execution time with theperiod of time over which the algorithm is operating0 _ tðsecÞ _ 1;000 (Fig. 5), we can see that our algorithmexecutes very efficiently.6.2.2 Comparison with Other Resource ProvisioningAlgorithmsRecall that our proposed algorithm for resource reservationin the cloud (PBRA) is based on windows with variablesizes (i.e., variable time slots as shown in Fig. 3). The sizeof every window and the amount of reserved resources inevery window is determined to minimize the financial coston the media content provider. We evaluate the performanceof our PBRA algorithm against two other resourceprovisioning schemes: fixed window size scheme (denotedby fixed-reserve-time), and the pay-as-you-go resourceallocation scheme which is widely used in the clouds(denoted by pay-as-you-go). The fixed window sizescheme is based on resource reservation wherein all timeslots(windows) are of the same size (i.e., wj is the same8j). The pay-as-you-go scheme is based on on-demandresource allocation wherein resources are allocated uponneeded. The price of reserved resources is less than the ondemandresources since time-discounted rates are onlygiven to the reserved resources.We computed the overall financial cost when using eachof the above schemes for resource allocation in the cloud.To plot the comparison figure, we computed the ratio ofthe overall cost for every value of wmin to the cost whenusing our PBRA algorithm with wmin ¼ 1 (Normalizedcost) (Fig. 8). In the case of Fixed-reserve-time, we set wjalways fixed as wj ¼ wmin 8j, and wj ¼ 10. We can see thatPBRA outperforms the Fixed-reserve-time scheme for allvalues of wmin. This is because PBRA selects window sizesaccording to the predicated demand such that the rightamount of resource is reserved in the cloud that maximallybenefits from the time-discount rates in the tariffs, andensures that reserved resources meet the actual demandwithout incurring wastage. PBRA also outperforms thepay-as-you-go scheme because it maximally benefits fromthe time-discounted rates given to the reserved resources,while no discount is given to resources allocated using theon-demand scheme.6.2.3 Impact of Different Probability Distributions of theDemandIn the next set of evaluations, we considered three log-normalprobability distribution functions for the demand withsame mean but varied variances. The mean of all log-normaldistributions E½DðtÞ_ is given in Eq. (8), where IðtÞ is givenin Fig. 5, mQ ¼ 1, while variances of the log-normal distributionswere set to 3, 6, and 8.The stochastic effect of demand on the cost of reservedresources using PBRA is shown in Table 2 when h ¼ 0:75.We observe that the overall resource reservation costincreases as the variance of the log-normal distributionincreases. This is because larger variance means higher likelihoodthat the reserved resources in the cloud do not meetthe actual demand. Consequently, higher reserved resourcesare required in the cloud to meet the actual demandgiven a certain probabilistic confidence h, which results inhigher cost for resource reservation in the cloud.6.3 Evaluation of the Hybrid Approach for ResourceAllocation in the CloudIn this section, we evaluate the performance of our hybridresource allocation algorithm proposed in Section 5. Ourhybrid approach enables the media content provider to efficientlyallocate resources in the cloud using both the reservationresource provisioning plan and the on-demandresource provisioning plan offered by the cloud provider.As we have discussed in Section 5, the right value ofparameter h has to be determined for this hybrid approachto optimally perform. To investigate the impact of differentvalues of h on the performance of the hybrid approach, weconsidered continuous non-linear tariffs that are functionsof both the allocated resources and reservation time. Weused time-discount rates similar to those used in the pricingmodel employed by Amazon EC2 [6] in order to derive tarifffunctions that we used in our evaluations. Time discountrates are only offered to reserved resources, while no timediscount rates are offered to resources allocated using theon-demand plan. An example of a tariff function that weFig. 8. Performance comparisons.TABLE 2Media Streaming Cost Given Different Probability Distributions of the Demand (in $)1030 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015used in our evaluations for units of allocated resourcesequal 3 is depicted in Fig. 6. Referring to Fig. 6, if the averageunits of resources allocated in the cloud for 6 time unitsusing the on-demand plan is 3, then the cost is 15 _ 6 ¼ $90;whereas if the media content provider reserves (prepaidpurchase) the same amount of resources for 6 time unitsusing the reservation plan, then the price charged is only13 _ 6 ¼ $78.In the next set of simulations, we consider a demandwith mean E½DðtÞ_ given in Eq. (8), where IðtÞ is given inFig. 5, mQ ¼ 1, and variance of 3. Recall that our hybridapproach selects the right value of h in every window. Inevery window j, different values of h are tested to selectsthe one that yields the least overall cost. Table 3 show thecost of resources allocated using both the resource reservationplan and resource on-demand plan when j ¼ 7 (correspondingto t ¼ 650), which results from using our hybridalgorithm. We observe that when h increases, the cost of theresources allocated using the reservation plan increases,while the cost of resources allocated using the on-demandplan decreases. This is because higher amount of reservedresources is required in the cloud for higher h and, consequently,less amount of on-demand resources is needed. Wealso observe that when h increases from 0:75 to 0:8 the overallcost (i.e., the cost of both reservation and on-demandresources) decreases; whereas when h increases beyond 0:8the overall cost increases. This is because the over-subscribed(over-provisioning) cost of the reserved resourcesbecomes very high when h > 0:8. We can see that the optimumvalue of h (i.e., the value of h that yields the least overallcost) when j ¼ 7 is about 0:8.To get a sense of how the optimal selection of thevalue of h can significantly reduce the overall monetarycost on the media content provider when using thishybrid streaming resource provisioning approach, let uscompare the total cost when using our hybrid resourceallocation algorithm at j ¼ 7 against two cases: the casewhen the media content provider uses the on-demandplan only (pay-as-you-go), and the case when the mediacontent provider uses the reservation plan only (fixedreserve-time). We observed that the cost of our hybridapproach when h_ ¼ 0:8 is $45;833; while the cost of allocatedresource in the case of pay-as-you-go is fixed atabout $52;000 (does not depend on the value of h), andthe cost of allocated resources in the case of fixed-reservetimewhen h ¼ 0:8 is about $48;000 (Fig. 9). Hence, ouralgorithm reduces the cost by an amount of about $6;200compared to pay-as-you-go (i.e., about 12 percent cost saving),and reduces the cost by an amount of $2;200 comparedto fixed-reserve-time (i.e., 4:5 percent cost saving).We note here that the cost was computed for onlyone video channel. However, a media content providergenerally offers hundreds of video channels to its clients.Therefore, the overall cost-saving using our proposedalgorithm can be significantly high for large number ofvideo channels offered by the media content provider.7 CONCLUSION AND FUTURE WORKThis paper studies the problem of resource allocations in thecloud for media streaming applications. We have considerednon-linear time-discount tariffs that a cloud providercharges for resources reserved in the cloud. We have proposedalgorithms that optimally determine both the amountof reserved resources in the cloud and their reservationtime—based on prediction of future demand for streamingcapacity—such that the financial cost on the media contentprovider is minimized. The proposed algorithms exploit thetime discounted rates in the tariffs, while ensuring that sufficientresources are reserved in the cloud without incurringwastage. We have evaluated the performance of our algorithmsnumerically and using simulations. The results showthat our algorithms adjust the tradeoff between resourcesreserved on the cloud and resources allocated on-demand.In future work, we shall perform experimental measurementsto characterize the streaming demand in the Internetand develop our own demand forecasting module. We shallalso investigate the case of multiple cloud providers andconsider the market competition when allocating resourcesin the clouds.ACKNOWLEDGMENTSThis work was supported by the National Center of Electronics,Communication, and Photonics at King AbdulazizCity for Science and Technology (Saudi Arabia). This paperwas based in part on a paper appeared in the proceeding ofthe IEEE Globecom 2012.TABLE 3Media Streaming Cost Using Two Resource Allocation Plans Provided by the Cloud (Hybrid Resource Provisioning Approach) (in $)Fig. 9. Hybrid approach performance comparisons.ALASAAD ET AL.: INNOVATIVE SCHEMES FOR RESOURCE ALLOCATION IN THE CLOUD FOR MEDIA STREAMING APPLICATIONS

Improving Web Navigation Usability by Comparing Actual and Anticipated Usage

05/08/201902/07/2019 by admin

We present a new method to identify navigation related Web usability problems based on comparing actual and anticipated usage patterns. The actual usage patterns can be extracted from Web server logs routinely recorded for operational websites by first processing the log data to identify users, user sessions, and user task-oriented transactions, and then applying a usage mining algorithm to discover patterns among actual usage paths. The anticipated usage, including information about both the path and time required for user-oriented tasks, is captured by our ideal user interactive path models constructed by cognitive experts based on their cognition of user behavior.

The comparison is performed via the mechanism of test MY SQL for checking results and identifying user navigation difficulties. The deviation data produced from this comparison can help us discover usability issues and suggest corrective actions to improve usability. A software tool was developed to automate a significant part of the activities involved. With an experiment on a small service-oriented website, we identified usability problems, which were cross-validated by domain experts, and quantified usability improvement by the higher task success rate and lower time and effort for given tasks after suggested corrections were implemented. This case study provides an initial validation of the applicability and effectiveness of our method.

1.2 INTRODUCTION

As the World Wide Web becomes prevalent today, building and ensuring easy-to-use Web systems is becoming a core competency for business survival. Usability is defined as the effectiveness, efficiency, and satisfaction with which specific users can complete specific tasks in a particular environment. Three basic Web design principles, i.e., structural firmness, functional convenience, and presentational delight, were identified to help improve users’ online experience. Structural firmness relates primarily to the characteristics that influence the website security and performance. Functional convenience refers to the availability of convenient characteristics, such as a site’s ease of use and ease of navigation, that help users’ interaction with the interface. Presentational delight refers to the website characteristics that stimulate users’ senses. Usability engineering provides methods for measuring usability and for addressing usability issues. Heuristic evaluation by experts and user-centered testing are typically used to identify usability issues and to ensure satisfactory usability.

However, significant challenges exist, including 1) accuracy of problem identification due to false alarms common in expert evaluation 2) unrealistic evaluation of usability due to differences between the testing environment and the actual usage environment, and 3) increased cost due to the prolonged evolution and maintenance cycles typical for many Web applications. On the other hand, log data routinely kept at Web servers represent actual usage. Such data have been used for usage-based testing and quality assurance and also for understanding user behavior and guiding user interface design.

Server-side logs can be automatically generated by Web servers, with each entry corresponding to a user request. By analyzing these logs, Web workload was characterized and used to suggest performance enhancements for Internet Web servers. Because of the vastly uneven Web traffic, massive user population, and diverse usage environment, coverage-based testing is insufficient to ensure the quality of Web applications. Therefore, server-side logs have been used to construct Web usage models for usage-based Web testing or to automatically generate test cases accordingly to improve test efficiency.

1.3 LITRATURE SURVEY

WEB USABILITY PROBE: A TOOL FOR SUPPORTING REMOTE USABILITY EVALUATION OF WEB SITES

PUBLICATION: Human-Computer Interaction—INTERACT 2011. New York, NY, USA: Springer, 2011,pp. 349–357.

AUTHORS: T. Carta, F. Patern`o, and V. F. D. Santana

EXPLANATION:

Usability evaluation of Web sites is still a difficult and time-consuming task, often performed manually. This paper presents a tool that supports remote usability evaluation of Web sites. The tool considers client-side data on user interactions and JavaScript events. In addition, it allows the definition of custom events, giving evaluators the flexibility to add specific events to be detected and considered in the evaluation. The tool supports evaluation of any Web site by exploiting a proxy-based architecture and enables the evaluator to perform a comparison between actual user behavior and an optimal sequence of actions.

SUPPORTING ACTIVITY MODELLING FROM ACTIVITY TRACES

PUBLICATION: Expert Syst., vol. 29, no. 3, pp. 261–275, 2012.

AUTHORS: O. L. Georgeon, A. Mille, T. Bellet, B. Mathern, and F. E. Ritter,

EXPLANATION:

We present a new method and tool for activity modelling through qualitative sequential data analysis. In particular, we address the question of constructing a symbolic abstract representation of an activity from an activity trace. We use knowledge engineering techniques to help the analyst build ontology of the activity, that is, a set of symbols and hierarchical semantics that supports the construction of activity models. The ontology construction is pragmatic, evolutionist and driven by the analyst in accordance with their modelling goals and their research questions. Our tool helps the analyst define transformation rules to process the raw trace into abstract traces based on the ontology. The analyst visualizes the abstract traces and iteratively tests the ontology, the transformation rules and the visualization format to confirm the models of activity. With this tool and this method, we found innovative ways to represent a car-driving activity at different levels of abstraction from activity traces collected from an instrumented vehicle. As examples, we report two new strategies of lane changing on motorways that we have found and modelled with this approach.

TOOLS FOR REMOTE USABILITY EVALUATION OF WEB APPLICATIONS THROUGH BROWSER LOGS AND TASK MODELS

PUBLICATION: Behavior Res.Methods, Instrum., Comput., vol. 35, no. 3, pp. 369–378, 2003

AUTHORS: L. Paganelli and F. Patern`o,

EXPLANATION:

The dissemination of Web applications is extensive and still growing. The great penetration of Web sites raises a number of challenges for usability evaluators. Video-based analysis can be rather expensive and may provide limited results. In this article, we discuss what information can be provided by automatic tools able to process the information contained in browser logs and task models. To this end, we present a tool that can be used to compare log files of user behavior with the task model representing the actual Web site design, in order to identify where users’ interactions deviate from those envisioned by the system design.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Previous studies usability has long been addressed and discussed, when people navigate the Web they often encounter a number of usability issues. This is also due to the fact that Web surfers often decide on the spur of the moment what to do and whether to continue to navigate in a Web site. Usability evaluation is thus an important phase in the deployment of Web applications. For this purpose automatic tools are very useful to gather larger amount of usability data and support their analysis.

Remote evaluation implies that users and evaluators are separated in time and/or space. This is important in order to analyse users in their daily environments and decreases the costs of the evaluation without requiring the use of specific laboratories and asking the users to move. In addition, tools for remote Web usability evaluation should be sufficiently general so that they can be used to analyse user behaviour even when using various browsers or applications developed using different toolkits. We prefer logging on the client-side in order to be able to capture any user-generated events, which can provide useful hints regarding possible usability problems.

Existing approaches have been used to support usability evaluation. An example was WebRemUsine, which was a tool for remote usability evaluation of Web applications through browser logs and task models. Propp and Frorbrig have used task models for supporting usability evaluation of a different type of application: cooperative behaviour of people interacting in smart environments. A different use of models is in the authors discuss how task models can enhance visualization of the usability test log. In our case we do not require the effort of developing models to apply our tool. We only require that the designer provides an example of optimal use associated with each of the relevant tasks. The tool will then compare the logs with the actual use with the optimal log in order to identify deviations, which may indicate potential usability problems.

2.1.1 DISADVANTAGES:

Web navigate used a logger to collect data from a user session test on a Web interface prototype running on a PDA simulator in order to evaluate different types of Web navigation tools and identify the best one for small display devices.

Users were asked to find the answer to specific questions using different types of navigation tools to move from one page to another. A database was used to store users’ actions, but they logged only the answer given by the user to each specific question. Moreover they stored separately every term searched by the user by means of the internal search tool.

Client-side data encounters different challenges regarding the identification of the elements that users are interacting with, how to manage element identification when the page is changed dynamically, how to manage data logging when users are going from one page to another, amongst others. The following are some of the solutions we adopted in order to deal with these issues.

2.2 PROPOSED SYSTEM:

We propose a new method to identify navigation related usability problems by comparing Web usage patterns extracted from server logs against anticipated usage represented in some cognitive user models (RQ2). Fig. 1 shows the architecture of our method. It includes three major modules: Usage Pattern Extraction, IUIP Modeling, and Usability Problem Identification. First, we extract actual navigation paths from server logs and discover patterns for some typical events. In parallel, we construct IUIP models for the same events. IUIP models are based on the cognition of user behavior and can represent anticipated paths for specific user-oriented tasks.

Our IUIP models are based on the cognitive models surveyed in Section II, particularly the ACT-R model. Due to the complexity of ACT-R model development and the low-level rule based programming language it relies on we constructed our own cognitive architecture and supporting tool based on the ideas from ACT-R. In general, the user behavior patterns can be traced with a sequence of states and transitions. Our IUIP consists of a number of states and transitions. For a particular goal, a sequence of related operation rules can be specified for a series of transitions. Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost.

Typically, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner. Based on this cognitive mechanism, IUIP models our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

2.2.1 ADVANTAGES:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated follow up pages will not be used themselves for deviation calculations to avoid double counting.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

IUIP MODELS:

Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. For example, human factors guidelines specify the upper bound for the response time to mitigate the risk that users will lose interest in a website. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner on this cognitive mechanism, IUIP models need to be constructed individually for novices and experts by cognitive experts by utilizing their domain expertise and their knowledge of different users’ interactive behavior.

We can adapt the durations by performing iterative tests with different users Diagrammatic notation methods and tools are often used to support interaction modeling and task performance evaluation IUIP model construction and reuse, we used C++ and XML to develop our IUIP modeling tool based on the open-source visual diagram software DIA. DIA allows users to draw customized diagrams, such as UML, data flow, and other diagrams. Existing shapes and lines in DIA form part of the graphic notations in our IUIP models. New ones can be easily added by writing simple XML files. The operations, operation rules, and computation rules can be embedded into the graphic notations with XML schema we defined to form our IUIP symbols. Currently, about 20 IUIP symbols have been created to represent typical Web interactions. IUIP symbols used in subsequent examples are explained at the bottom of cognitive experts can use our IUIP modeling tool to develop various IUIP models for different Web applications.

The actual users’ navigation trails we extracted from the aggregated trail tree are compared against corresponding IUIP models automatically. This comparison will yield a set of deviations between the two. We can identify some common problems of actual users’ interaction with the Web application by focusing on deviations that occur frequently. Combined with expertise in product internal and contextual information, our results can also help identify the root causes of some usability problems existing in the Web design. Based on logical choices made and time spent by users at each page, the calculation of deviations between actual users’ usage patterns and IUIP can be divided into two parts:

1) Logical deviation calculation:

a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.

b) Sum up all the above deviations over all the selected user transactions for each page.

The IUIP model for the task “First Selection” is shown on the top. The corresponding user Trail 7, a part of a trail tree extracted from log data, is presented under it. The node in the tree is annotated with the number of users having reached the node across the same trail prefix. The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated followup pages will not be used themselves for deviation calculations to avoid double counting.

4.1 ALGORITHM

TRAIL TREE USAGE MINING ALGORITHM

The transactions identified from each user session form a collection of paths use the trie data structure to merge the paths along common prefixes. A trie, or a prefix tree, is an ordered tree used to store an associative array where the keys are usually strings. All the descendants of a node have a common prefix of the string associated with that node. The root is associated with the empty string. We adapted the trie algorithm to construct a tree structure that also captures user visit frequencies, which is called a trail tree in our work. In a trail tree, a complete path from the root to a leaf node is called a trail.

The leaf nodes of the trail tree are also annotated with the trail names. The transaction paths extracted from the Web server log are shown in the table to its left, together with path occurrence frequencies. Paths 1, 4, and 5 have the common first node a; therefore, they were merged together. For the second node of this subtree, Paths 1 and 4 both accessed Page b; therefore, the two paths were combined at Node b. Finally, Paths 1 and 4 were merged into a single trail, Trail 1, although Path 1 terminates at Node e. By the same method, the other paths can be integrated into the trail tree. The number at each edge indicates the number of users reaching the next node across the same trail prefix.

Based on the aggregated trail tree, further mining can be performed for some “interesting” pattern discovery. Typically, good mining results require a close interaction of the human experts to specify the characteristics that make navigation patterns interesting. In our method, we focus on the paths which are used by a sufficient number of users to finish a specific task. The paths can be initially prioritized by their usage frequencies and selected by using a threshold specified by the experts. Application-domain knowledge and contextual information, such as criticality of specific tasks, user privileges, etc., can also be used to identified “interesting” patterns. For the FG 2009 website, we extracted 30 trails each for Tasks 1, 2, and 3, and 5 trails for Task 4.

4.2 MODULES:

COGNITIVE USER MODEL:

WEB SERVER USER LOG:

USAGE PATTERN EXTRACTION:

USABILITY MEASURING:

4.3 MODULE DESCRIPTION:

COGNITIVE USER MODEL:

User Models is a growing need to incorporate insights from cognitive science about the mechanisms, strengths, and limits of human perception and cognition to understand the human factors involved in user interface design in the various constraints on cognition (e.g., system complexity) and the mechanisms and patterns of strategy selection can help human factor engineers develop solutions and apply technologies that are better suited to human abilities.

Commonly used cognitive models include GOMS, EPIC, and ACT-R. The GOMS model consists of Goals, Operators, Methods, and Selection rules. As the high-level architecture, GOMS describes behavior and defines interactions as a static sequence of human actions. As the low-level cognitive architecture, EPIC (Executive-Process/Interactive Control) and ACT-R (Adaptive Control of Thought-Rational) can be taken as the specific implementation of the high-level architecture.

They provide detailed information about how to simulate human processing and cognition important feature of these low-level cognitive architectures is that they are all implemented as computer programming systems so that cognitive models may be specified, executed, and their outputs (e.g., error rates and response latencies) compared with human performance data.

WEB SERVER USER LOG:

Server logs have also been used by organizations to learn about the usability of their products. For example, search queries can be extracted from server logs to discover user information needs for usability task analysis. There are many advantages to using server logs for usability studies. Logs can provide insight into real users performing actual tasks in natural working conditions versus in an artificial setting of a lab. Logs also represent the activities of many users over a long period of time versus the small sample of users in a short time span in typical lab testing. Data preparation techniques and algorithms can be used to process the raw Web server logs, and then mining can be performed to discover users’ visitation patterns for further usability analysis.

For example, organizations can mine server-side logs to predict users’ behavior and context to satisfy users’ revisitiation patterns can be discovered by mining server logs to develop guidelines for browser history mechanism that can be used to reduce users’ cognitive and physical effort Client-side logs can capture accurate comprehensive usage data for usability analysis, because they allow low-level user interaction events such as keystrokes and mouse movements to be recorded.

For example, using these client-side data, the evaluator can accurately measure time spent on particular tasks or pages as well as study the use of “back” button and user click streams. Such data are often used with task based approaches and models for usability analysis by comparing discrepancies between the designer’s anticipation and a user’s actual behavior. However, the evaluator must program the UI, modify Web pages, or use an instrumented browser with plug-in tools or a special proxy server to collect such data.

USAGE PATTERN EXTRACTION:

Web server logs are our data source. Each entry in a log contains the IP address of the originating host, the timestamp, the requested Web page, the referrer, the user agent and other data. Typically, the raw data need to be preprocessed and converted into user sessions and transactions to extract usage patterns.

The data preparation and preprocessing include the following domain-dependent tasks.

1) Data cleaning: This task is usually site-specific and involves removing extraneous references to style files, graphics, or sound files that may not be important for the purpose of our analysis.

2) User identification: The remaining entries are grouped by individual users. Because no user authentication and cookie information is available in most server logs, we used the combination of IP, user agent, and referrer fields to identify unique users.

3) User session identification: The activity record of each user is segmented into sessions, with each representing a single visit to a site. Without additional authentication information from users and without the mechanisms such as embedded session IDs, one must rely on heuristics for session identification. For example, we set an elapse time of 15 min between two successive page accesses as a threshold to partition a user activity record into different sessions.

4) Path completion: Client or proxy side caching can often result in missing access references to some pages that have been cached. These missing references can often be heuristically inferred from the knowledge of site topology and referrer information, along with temporal information from server logs.

These tasks are time consuming and computationally intensive, but essential to the successful discovery of usage patterns.

We developed a tool to automate all these tasks except part of path completion. For path completion, the designers or developers first need to manually discover the rules of missing references based on site structure, referrer, and other heuristic information. Once the repeated patterns are identified, this work can be automatically carried out. Our tool can work with server logs of different Web applications by modifying the related parameters in the configuration file. The processed log data are stored into a database for further use.

USABILITY MEASURING:

Our specific results from applying our method to the FG 2009 website we collected Web server access log data for the first three days after its deployment. The server log includes about above 500 entries. After preprocessing the raw log data using our tool, we identified 58 unique users and 81 sessions. Then, we constructed four event models for four typical tasks. We extracted 95 trails for these tasks. Meanwhile, a designer with three-year GUI design experience and an expert with five-year experience with human factors practice for the Web constructed four IUIP models for the same tasks based on their cognition of users’ interactive behavior. By checking the extracted usage patterns against the four IUIP models, we obtained logical and temporal deviations shown in Tables I and II and identified 17 usability issues or potential usability problems. Some usability issues were identified by both logical and temporal deviation analyses. Next, we further analyze these deviations for usability problem identification and improvement.

In Table I, 16 deviations took place in the page “index.php.” The unanticipated followup page is the page “login.php,” followed by the page “index.php?f=t” (login failure). Further reviewing the index page, we found that the page design is too simplistic: No instruction was provided to help users to login or register. We inferred that some users with limited online shopping experience were trying to use their regular email addresses and passwords to log in to the FG 2009 website.

We also found some structure design issues. For example, we observed that some users repeatedly visited the page “Selection Rules.” It is likely that when the users were not permitted to select any furniture in some categories (the FG website limited each user to select one piece of furniture under each category), they had to go to the page “Selection Rules” to find the reasons. To reduce these redundant operations and improve user experience, the help function for selection rules should be redesigned to make it more convenient for users to consult.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 8

8.1 CONCLUSION:

We have developed a new method for the identification and improvement of navigation-related Web usability problems by checking extracted usage patterns against cognitive user models. As demonstrated by our case study, our method can identify areas with usability issues to help improve the usability of Web systems. Once a website is operational, our method can be continuously applied and drive ongoing refinements. In contrast with traditional software products and systems, Web based applications have shortened development cycles and prolonged maintenance cycles. Our method can contribute significantly to continuous usability improvement over these prolonged maintenance cycles. The usability improvement in successive iterations can be quantified by the progressively better effectiveness (higher task completion rate) and efficiency (less time for given tasks).

Our method is not intended to and cannot replace heuristic usability evaluation by experts and user-centered usability testing. It complements these traditional usability practices and can be incorporated into an integrated strategy for Web usability assurance. With automated tool support for a significant part of the activities involved, our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

8.2 FUTURE ENHANCEMENT: In the future, we should and must carry out validation studies with large-scale Web applications. We also plan to explore additional approaches to discover Web usage patterns and related usability problems generalizable to other interesting domains. For example, we have already started exploring deviation calculation and analysis at the trail level instead of at the individual page level. Such analyses might be more meaningful and yield more interesting results for Web applications with complex structure and operation sequences. Our IUIP modeling architecture and supporting tools also need to be further enhanced and optimized for more complex tasks. We will also further expand our usability research to cover more usability aspects to improve Web users’ overall satisfaction.