Real-Time Big Data Analytical Architecture for Remote Sensing Application

05/08/201902/07/2019 by admin

In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.

The proposed architecture comprises three main units:

1) Remote sensing Big Data acquisition unit (RSDU);

2) Data processing unit (DPU); and

3) Data analysis decision unit (DADU).

First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU.

1.2 INTRODUCTION:

Recently, a great deal of interest in the field of Big Data and its analysis has risen mainly driven from extensive number of research challenges strappingly related to bonafide applications, such as modeling, processing, querying, mining, and distributing large-scale repositories. The term “Big Data” classifies specific kinds of data sets comprising formless data, which dwell in data layer of technical computing applications and the Web. The data stored in the underlying layer of all these technical computing application scenarios have some precise individualities in common, such as 1) largescale data, which refers to the size and the data warehouse; 2) scalability issues, which refer to the application’s likely to be running on large scale (e.g., Big Data); 3) sustain extraction transformation loading (ETL) method from low, raw data to well thought-out data up to certain extent; and 4) development of uncomplicated interpretable analytical over Big Data warehouses with a view to deliver an intelligent and momentous knowledge for them.

Big Data are usually generated by online transaction, video/audio, email, number of clicks, logs, posts, social network data, scientific data, remote access sensory data, mobile phones, and their applications. These data are accumulated in databases that grow extraordinarily and become complicated to confine, form, store, manage, share, process, analyze, and visualize via typical database software tools. Advancement in Big Data sensing and computer technology revolutionizes the way remote data collected, processed, analyzed, and managed. Particularly, most recently designed sensors used in the earth and planetary observatory system are generating continuous stream of data. Moreover, majority of work have been done in the various fields of remote sensory satellite image data, such as change detection, gradient-based edge detection region similarity based edge detection and intensity gradient technique for efficient intraprediction.

In this paper, we referred the high speed continuous stream of data or high volume offline data to “Big Data,” which is leading us to a new world of challenges. Such consequences of transformation of remotely sensed data to the scientific understanding are a critical task. Hence the rate at which volume of the remote access data is increasing, a number of individual users as well as organizations are now demanding an efficient mechanism to collect, process, and analyze, and store these data and its resources. Big Data analysis is somehow a challenging task than locating, identifying, understanding, and citing data. Having a large-scale data, all of this has to happen in a mechanized manner since it requires diverse data structure as well as semantics to be articulated in forms of computer-readable format.

However, by analyzing simple data having one data set, a mechanism is required of how to design a database. There might be alternative ways to store all of the same information. In such conditions, the mentioned design might have an advantage over others for certain process and possible drawbacks for some other purposes. In order to address these needs, various analytical platforms have been provided by relational databases vendors. These platforms come in various shapes from software only to analytical services that run in third-party hosted environment. In remote access networks, where the data source such as sensors can produce an overwhelming amount of raw data.

We refer it to the first step, i.e., data acquisition, in which much of the data are of no interest that can be filtered or compressed by orders of magnitude. With a view to using such filters, they do not discard useful information. For instance, in consideration of new reports, is it adequate to keep that information that is mentioned with the company name? Alternatively, is it necessary that we may need the entire report, or simply a small piece around the mentioned name? The second challenge is by default generation of accurate metadata that describe the composition of data and the way it was collected and analyzed. Such kind of metadata is hard to analyze since we may need to know the source for each data in remote access.

1.3 LITRATURE SURVEY:

BIG DATA AND CLOUD COMPUTING: CURRENT STATE AND FUTURE OPPORTUNITIES

AUTHOR: D. Agrawal, S. Das, and A. E. Abbadi

PUBLISH: Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.

EXPLANATION:

Scalable database management systems (DBMS)—both for update intensive application workloads as well as decision support systems for descriptive and deep analytics—are a critical part of the cloud infrastructure and play an important role in ensuring the smooth transition of applications from the traditional enterprise infrastructures to next generation cloud infrastructures. Though scalable data management has been a vision for more than three decades and much research has focussed on large scale data management in traditional enterprise setting, cloud computing brings its own set of novel challenges that must be addressed to ensure the success of data management solutions in the cloud environment. This tutorial presents an organized picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications. Our background study encompasses both classes of systems: (i) for supporting update heavy applications, and (ii) for ad-hoc analytics and decision support. We then focus on providing an in-depth analysis of systems for supporting update intensive web-applications and provide a survey of the state-of-theart in this domain. We crystallize the design choices made by some successful systems large scale database management systems, analyze the application demands and access patterns, and enumerate the desiderata for a cloud-bound DBMS.

CHANGE DETECTION IN SYNTHETIC APERTURE RADAR IMAGE BASED ON FUZZY ACTIVE CONTOUR MODELS AND GENETIC ALGORITHMS

AUTHOR: J. Shi, J. Wu, A. Paul, L. Jiao, and M. Gong

PUBLISH: Math. Prob. Eng., vol. 2014, 15 pp., Apr. 2014.

EXPLANATION:

This paper presents an unsupervised change detection approach for synthetic aperture radar images based on a fuzzy active contour model and a genetic algorithm. The aim is to partition the difference image which is generated from multitemporal satellite images into changed and unchanged regions. Fuzzy technique is an appropriate approach to analyze the difference image where regions are not always statistically homogeneous. Since interval type-2 fuzzy sets are well-suited for modeling various uncertainties in comparison to traditional fuzzy sets, they are combined with active contour methodology for properly modeling uncertainties in the difference image. The interval type-2 fuzzy active contour model is designed to provide preliminary analysis of the difference image by generating intermediate change detection masks. Each intermediate change detection mask has a cost value. A genetic algorithm is employed to find the final change detection mask with the minimum cost value by evolving the realization of intermediate change detection masks. Experimental results on real synthetic aperture radar images demonstrate that change detection results obtained by the improved fuzzy active contour model exhibits less error than previous approaches.

A BIG DATA ARCHITECTURE FOR LARGE SCALE SECURITY MONITORING

AUTHOR: S. Marchal, X. Jiang, R. State, and T. Engel

PUBLISH: Proc. IEEE Int. Congr. Big Data, 2014, pp. 56–63.

EXPLANATION:

Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.

The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.

In the case of simple algorithms, this is quite easy: the input is just split into different non-overlapping pieces that are processed one by one. But most algorithms do consider the neighborhood of each pixel. As a consequence, in most cases, the data will have to be split into partially overlapping pieces. The objective is to obtain the same result as the original algorithm as if the processing was done in one go. Depending on the algorithm, this is unfortunately not always possible.

2.1.1 DISADVANTAGES:

A reader that loads the image, or part of the image in memory from the file on disk;

A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;

A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;

A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.

2.2 PROPOSED SYSTEM:

We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.

On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.

We proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system. Furthermore, the proposed architecture has the capability of storing incoming raw data to perform offline analysis on largely stored dumps, when required. Finally, a detailed analysis of remotely sensed earth observatory Big Data for land and sea area are provided using .NET. In addition, various algorithms are proposed for each level of RSDU, DPU, and DADU to detect land as well as sea area to elaborate the working of architecture.

2.2.1 ADVANTAGES:

Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.

Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.

The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.

We have used the proposed architecture to perform analysis and proposed an algorithm for handling, processing, analyzing, and decision-making for remote sensing Big Data images using our proposed architecture.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.

Healthcare scenarios, medical practitioners gather massive volume of data about patients, medical history, medications, and other details. The above-mentioned data are accumulated in drug-manufacturing companies. The nature of these data is very complex, and sometimes the practitioners are unable to show a relationship with other information, which results in missing of important information. With a view in employing advance analytic techniques for organizing and extracting useful information from Big Data results in personalized medication, the advance Big Data analytic techniques give insight into hereditarily causes of the disease.

4.1 ALGORITHM:

This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.

The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.

ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.

4.2 MODULES:

DATA ANALYSIS DECISION UNIT (DADU):

DATA PROCESSING UNIT (DPU):

REMOTE SENSING APPLICATION RSDU:

FINDINGS AND DISCUSSION:

ALGORITHM DESIGN AND TESTING:

4.3 MODULE DESCRIPTION:

DATA PROCESSING UNIT (DPU):

In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.

Each processing server has its algorithm implementation for processing incoming segment of data from FLBS. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results generated by each server are then sent to the aggregation server for compilation, organization, and storing for further processing.

DATA ANALYSIS DECISION UNIT (DADU):

DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.

The aggregation server also sends the same copy of that result to the decision-making server to process that result for making decision. The decision-making server is supported by the decision algorithms, which inquire different things from the result, and then make various decisions (e.g., in our analysis, we analyze land, sea, and ice, whereas other finding such as fire, storms, Tsunami, earthquake can also be found). The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant since any small error in decision-making can degrade the efficiency of the whole analysis. DADU finally displays or broadcasts the decisions, so that any application can utilize those decisions at real time to make their development. The applications can be any business software, general purpose community software, or other social networks that need those findings (i.e., decision-making).

REMOTE SENSING APPLICATION RSDU:

Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.

However, to make the raw data into image format, the remote sensing satellite uses effective data analysis, remote sensing satellite preprocesses data under many situations to integrate the data from different sources, which not only decreases storage cost, but also improves analysis accuracy. The data must be corrected in different methods to remove distortions caused due to the motion of the platform relative to the earth, platform attitude, earth curvature, nonuniformity of illumination, variations in sensor characteristics, etc. The data is then transmitted to Earth Base Station for further processing using direct communication link. We divided the data processing procedure into two steps, such as real-time Big Data processing and offline Big Data processing. In the case of offline data processing, the Earth Base Station transmits the data to the data center for storage. This data is then used for future analyses. However, in real-time data processing, the data are directly transmitted to the filtration and load balancer server (FLBS), since storing of incoming real-time data degrades the performance of real-time processing.

FINDINGS AND DISCUSSION:

Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.

1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.

2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.

3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.

4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.

5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.

The MPH and SPH data are in ASCII format, whereas all the other data sets are in binary format. MDS, ADS, and GADs consist of the sequence of records and one or more fields of the data for each record. In our case, the MDS contains number of records, and each record contains a number of fields. Each record of the MDS corresponds to one row of the satellite image, which is our main focus during analysis.

ALGORITHM DESIGN AND TESTING:

Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET Framework is a language-neutral platform for writing programs that can easily and securely interoperate. There’s no language barrier with .NET: there are numerous languages available to the developer including Managed C++, C#, Visual Basic and Java Script.

The .NET framework provides the foundation for components to interact seamlessly, whether locally or remotely on different platforms. It standardizes common data types and communications protocols so that components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

The code that targets .NET, and which contains certain extra Information – “metadata” – to describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed code contains the information that allows the CLR to guarantee, for instance, safe execution and interoperability.

Managed Data

With Managed Code comes Managed Data. CLR provides memory allocation and Deal location facilities, and garbage collection. Some .NET languages use Managed Data by default, such as C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR can, depending on the language you’re using, impose certain constraints on the features available. As with managed and unmanaged code, one can have both managed and unmanaged data in .NET applications – data that doesn’t get garbage collected but instead is looked after by unmanaged code.

Common Type System

The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety. This ensures that all classes are compatible with each other, by describing types in a common way. CTS define how types work within the runtime, which enables types in one language to interoperate with types in another language, including cross-language exception handling. As well as ensuring that types are only used in appropriate ways, the runtime also ensures that code doesn’t attempt to access memory that hasn’t been allocated to it.

Common Language Specification

The CLR provides built-in support for language interoperability. To ensure that you can develop managed code that can be fully used by developers using any programming language, a set of language features and rules for using them called the Common Language Specification (CLS) has been defined. Components that follow these rules and expose only CLS features are considered CLS-compliant.

7.3 THE CLASS LIBRARY

.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as well as Object. All objects derive from System. Object. As well as objects, there are value types. Value types can be allocated on the stack, which can provide useful flexibility. There are also efficient means of converting value types to object types if and when necessary.

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

The multi-language capability of the .NET Framework and Visual Studio .NET enables developers to use their existing programming skills to build all types of applications and XML Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to the family.

Visual Basic .NET has been updated to include many new and improved language features that make it a powerful object-oriented programming language. These features include inheritance, interfaces, and overloading, among others. Visual Basic also now supports structured exception handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Managed Extensions for C++ and attributed programming are just some of the enhancements made to the C++ language. Managed Extensions simplify the task of migrating existing C++ applications to the new .NET Framework.

C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid Application Development”. Unlike other languages, its specification is just the grammar of the language. It has no standard library of its own, and instead has been designed with the intention of using the .NET libraries as its own.

Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the world of XML Web Services and dramatically improves the interoperability of Java-language programs with existing software written in a variety of other programming languages.

Active State has created Visual Perl and Visual Python, which enable .NET-aware applications to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET environment. Visual Perl includes support for Active State’s Perl Dev Kit.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

C#.NET is also compliant with CLS (Common Language Specification) and supports structured exception handling. CLS is set of rules and constructs that are supported by the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET Framework; it manages the execution of the code and also makes the development process easier by providing services.

C#.NET is a CLS-compliant language. Any objects, classes, or components that created in C#.NET can be used in any other CLS-compliant language. In addition, we can use objects, classes, and components created in other CLS-compliant languages in C#.NET .The use of CLS ensures complete interoperability among applications, regardless of the languages used to create the application.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them. In other words, destructors are used to release the resources allocated to the object. In C#.NET the sub finalize procedure is available. The sub finalize procedure is used to complete the tasks that must be performed when an object is destroyed. The sub finalize procedure is called automatically when an object is destroyed. In addition, the sub finalize procedure can be called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION

Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated resources, such as objects and variables. In addition, the .NET Framework automatically releases memory for reuse by destroying objects that are no longer in use.

In C#.NET, the garbage collector checks for the objects that are not currently in use by applications. When the garbage collector comes across an object that is marked for garbage collection, it releases the memory occupied by the object.

OVERLOADING

Overloading is another feature in C#. Overloading enables us to define multiple procedures with the same name, where each procedure has a different set of arguments. Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

C#.NET also supports multithreading. An application that supports multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease the time taken by an application to respond to user interaction.

STRUCTURED EXCEPTION HANDLING

C#.NET supports structured handling, which enables us to detect and remove errors at runtime. In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using Try…Catch…Finally statements, we can create robust and effective exception handlers to improve the performance of our application.

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000 Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component. The Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data Services. The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the question from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset. Access either displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.

The algorithms proposed in this paper for each unit and subunits are used to analyze remote sensing data sets, which helps in better understanding of land and sea area. The proposed architecture welcomes researchers and organizations for any type of remote sensory Big Data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible for Big Data analysis for all applications, e.g., sensors and social networking. We are also planning to use the proposed architecture to perform complex analysis on earth observatory data for decision making at realtime, such as earthquake prediction, Tsunami prediction, fire detection, etc.

Real-Time Big Data Analytical Architecture for Remote Sensing Application

05/08/201902/07/2019 by admin

Rank-Based Similarity Search Reducing the Dimensional Dependence

05/08/201902/07/2019 by admin

This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.

1.2 INTRODUCTION

Of the fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, perhaps the most widely-encountered is that of similarity search. Similarity search is the foundation of k-nearest-neighbor (k-NN) classification, which often produces competitively-low error rates in practice, particularly when the number of classes is large. The error rate of nearest-neighbor classification has been shown to be ‘asymptotically optimal’ as the training set size increases. For clustering, many of the most effective and popular strategies require the determination of neighbor sets based at a substantial proportion of the data set objects: examples include hierarchical (agglomerative) methods such as content-based filtering methods for recommender systems and anomaly detection methods commonly make use of k-NN techniques, either through the direct use of k-NN search, or by means of k-NN cluster analysis.

A very popular density-based measure, the Local Outlier Factor (LOF), relies heavily on k-NN set computation to determine the relative density of the data in the vicinity of the test point [8]. For data mining applications based on similarity search, data objects are typically modeled as feature vectors of attributes for which some measure of similarity is defined Motivated at least in part by the impact of similarity search on problems in data mining, machine learning, pattern recognition, and statistics, the design and analysis of scalable and effective similarity search structures has been the subject of intensive research for many decades. Until relatively recently, most data structures for similarity search targeted low-dimensional real vector space representations and the euclidean or other Lp distance metrics.

However, many public and commercial data sets available today are more naturally represented as vectors spanning many hundreds or thousands of feature attributes that can be real or integer-valued, ordinal or categorical, or even a mixture of these types. This has spurred the development of search structures for more general metric spaces, such as the MultiVantage-Point Tree, the Geometric Near-neighbor Access Tree (GNAT), Spatial Approximation Tree (SAT), the M-tree, and (more recently) the Cover Tree (CT). Despite their various advantages, spatial and metric search structures are both limited by an effect often referred to as the curse of dimensionality.

One way in which the curse may manifest itself is in a tendency of distances to concentrate strongly around their mean values as the dimension increases. Consequently, most pairwise distances become difficult to distinguish, and the triangle inequality can no longer be effectively used to eliminate candidates from consideration along search paths. Evidence suggests that when the representational dimension of feature vectors is high (roughly 20 or more traditional similarity search accesses an unacceptably-high proportion of the data elements, unless the underlying data distribution has special properties. Even though the local neighborhood information employed by data mining applications is useful and meaningful, high data dimensionality tends to make this local information very expensive to obtain.

The performance of similarity search indices depends crucially on the way in which they use similarity information for the identification and selection of objects relevant to the query. Virtually all existing indices make use of numerical constraints for pruning and selection. Such constraints include the triangle inequality (a linear constraint on three distance values), other bounding surfaces defined in terms of distance (such as hypercubes or hyperspheres), range queries involving approximation factors as in Locality-Sensitive Hashing (LSH) or absolute quantities as additive distance terms. One serious drawback of such operations based on numerical constraints such as the triangle inequality or distance ranges is that the number of objects actually examined can be highly variable, so much so that the overall execution time cannot be easily predicted.

Similarity search, researchers and practitioners have investigated practical methods for speeding up the computation of neighborhood information at the expense of accuracy. For data mining applications, the approaches considered have included feature sampling for local outlier detection, data sampling for clustering, and approximate similarity search for k-NN classification. Examples of fast approximate similarity search indices include the BD-Tree, a widely-recognized benchmark for approximate k-NN search; it makes use of splitting rules and early termination to improve upon the performance of the basic KD-Tree. One of the most popular methods for indexing, Locality-Sensitive Hashing can also achieve good practical search performance for range queries by managing parameters that influence a tradeoff between accuracy and time.

1.3 LITRATURE SURVEY

THE RELEVANT SET CORRELATION MODEL FOR DATA CLUSTERING

AUTHOR:

PUBLISH:

EXPLANATION:

AUTHOR:

PUBLISH:

EXPLANATION:

AUTHOR:

PUBLISH:

EXPLANATION:

CHAPTER 2

2.0 SYSTEM ANALYSIS:

2.1 EXISTING SYSTEM:

2.1.1 DISADVANTAGES:

2.2 PROPOSED SYSTEM:

2.2.1 ADVANTAGES:

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

.NET

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

4.3 MODULE DESCRIPTION:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

8.2 FUTURE ENHANCEMENT:

PSMPA Patient Self-Controllable and Multi-Level Privacy-Preserving Cooperative Authentication in Dist

05/08/201902/07/2019 by admin

The Distributed m-healthcare cloud computing system considerably facilitates secure and efficient patient treatment for medical consultation by sharing personal health information among the healthcare providers. This system should bring about the challenge of keeping both the data confidentiality and patients’ identity privacy simultaneously. Many existing access control and anonymous authentication schemes cannot be straightforwardly exploited. To solve the problem proposed a novel authorized accessible privacy model (AAPM) is established. Patients can authorize physicians by setting an access tree supporting flexible threshold predicates.

Our new technique of attribute based designated verifier signature, a patient self-controllable multi-level privacy preserving cooperative authentication scheme (PSMPA) realizing three levels of security and privacy requirement in distributed m-healthcare cloud computing system is proposed. The directly authorized physicians, the indirectly authorized physicians and the unauthorized persons in medical consultation can respectively decipher the personal health information and/or verify patients’ identities by satisfying the access tree with their own attribute sets.

1.2 INTRODUCTION:

Distributed m-healthcare cloud computing systems have been increasingly adopted worldwide including the European Commission activities, the US Health Insurance Portability and Accountability Act (HIPAA) and many other governments for efficient and high-quality medical treatment. In m-healthcare social networks, the personal health information is always shared among the patients located in respective social communities suffering from the same disease for mutual support, and across distributed healthcare providers (HPs) equipped with their own cloud servers for medical consultant. However, it also brings about a series of challenges, especially how to ensure the security and privacy of the patients’ personal health information from various attacks in the wireless communication channel such as eavesdropping and tampering As to the security facet, one of the main issues is access control of patients’ personal health information, namely it is only the authorized physicians or institutions that can recover the patients’ personal health information during the data sharing in the distributed m-healthcare cloud computing system. In practice, most patients are concerned about the confidentiality of their personal health information since it is likely to make them in trouble for each kind of unauthorized collection and disclosure.

Therefore, in distributed m-healthcare cloud computing systems, which part of the patients’ personal health information should be shared and which physicians their personal health information should be shared with have become two intractable problems demanding urgent solutions. There has emerged various research results focusing on them. A fine-grained distributed data access control scheme is proposed using the technique of attribute based encryption (ABE). A rendezvous-based access control method provides access privilege if and only if the patient and the physician meet in the physical world. Recently, a patient-centric and fine-grained data access control in multi-owner settings is constructed for securing personal health records in cloud computing. However, it mainly focuses on the central cloud computing system which is not sufficient for efficiently processing the increasing volume of personal health information in m-healthcare cloud computing system.

Moreover, it is not enough for to only guarantee the data confidentiality of the patient’s personal health information in the honest-but-curious cloud server model since the frequent communication between a patient and a professional physician can lead the adversary to conclude that the patient is suffering from a specific disease with a high probability. Unfortunately, the problem of how to protect both the patients’ data confidentiality and identity privacy in the distributed m-healthcare cloud computing scenario under the malicious model was left untouched.

In this paper, we consider simultaneously achieving data confidentiality and identity privacy with high efficiency. As is described in Fig. 1, in distributed m-healthcare cloud computing systems, all the members can be classified into three categories: the directly authorized physicians with green labels in the local healthcare provider who are authorized by the patients and can both access the patient’s personal health information and verify the patient’s identity and the indirectly authorized physicians with yellow labels in the remote healthcare providers who are authorized by the directly authorized physicians for medical consultant or some research purposes (i.e., since they are not authorized by the patients, we use the term ‘indirectly authorized’ instead). They can only access the personal health information, but not the patient’s identity. For the unauthorized persons with red labels, nothing could be obtained. By extending the techniques of attribute based access control and designated verifier signatures (DVS) on de-identified health information

1.3 LITRATURE SURVEY

SECURING PERSONAL HEALTH RECORDS IN CLOUD COMPUTING: PATIENT-CENTRIC AND FINE-GRAINED DATA ACCESS CONTROL IN MULTI-OWNER SETTINGS

AUTHOR: M. Li, S. Yu, K. Ren, and W. Lou

PUBLISH: Proc. 6th Int. ICST Conf. Security Privacy Comm. Netw., 2010, pp. 89–106.

EXPLANATION:

Online personal health record (PHR) enables patients to manage their own medical records in a centralized way, which greatly facilitates the storage, access and sharing of personal health data. With the emergence of cloud computing, it is attractive for the PHR service providers to shift their PHR applications and storage into the cloud, in order to enjoy the elastic resources and reduce the operational cost. However, by storing PHRs in the cloud, the patients lose physical control to their personal health data, which makes it necessary for each patient to encrypt her PHR data before uploading to the cloud servers. Under encryption, it is challenging to achieve fine-grained access control to PHR data in a scalable and efficient way. For each patient, the PHR data should be encrypted so that it is scalable with the number of users having access. Also, since there are multiple owners (patients) in a PHR system and every owner would encrypt her PHR files using a different set of cryptographic keys, it is important to reduce the key distribution complexity in such multi-owner settings. Existing cryptographic enforced access control schemes are mostly designed for the single-owner scenarios. In this paper, we propose a novel framework for access control to PHRs within cloud computing environment. To enable fine-grained and scalable access control for PHRs, we leverage attribute based encryption (ABE) techniques to encrypt each patients’ PHR data. To reduce the key distribution complexity, we divide the system into multiple security domains, where each domain manages only a subset of the users. In this way, each patient has full control over her own privacy, and the key management complexity is reduced dramatically.

PRIVACY AND EMERGENCY RESPONSE IN E-HEALTHCARE LEVERAGING WIRELESS BODY SENSOR NETWORKS

AUTHOR: J. Sun, Y. Fang, and X. Zhu

PUBLISH: IEEE Wireless Commun., vol. 17, no. 1, pp. 66–73, Feb. 2010.

EXPLANATION:

Electronic healthcare is becoming a vital part of our living environment and exhibits advantages over paper-based legacy systems. Privacy is the foremost concern of patients and the biggest impediment to e-healthcare deployment. In addressing privacy issues, conflicts from the functional requirements must be taken into account. One such requirement is efficient and effective response to medical emergencies. In this article, we provide detailed discussions on the privacy and security issues in e-healthcare systems and viable techniques for these issues. Furthermore, we demonstrate the design challenge in the fulfillment of conflicting goals through an exemplary scenario, where the wireless body sensor network is leveraged, and a sound solution is proposed to overcome the conflict.

HCPP: CRYPTOGRAPHY BASED SECURE EHR SYSTEM FOR PATIENT PRIVACY AND EMERGENCY HEALTHCARE

AUTHOR: J. Sun, X. Zhu, C. Zhang, and Y. Fang

PUBLISH: Proc. 31st Int. Conf. Distrib. Comput. Syst., 2011, pp. 373–382.

EXPLANATION:

Privacy concern is arguably the major barrier that hinders the deployment of electronic health record (EHR) systems which are considered more efficient, less error-prone, and of higher availability compared to traditional paper record systems. Patients are unwilling to accept the EHR system unless their protected health information (PHI) containing highly confidential data is guaranteed proper use and disclosure, which cannot be easily achieved without patients’ control over their own PHI. However, cautions must be taken to handle emergencies in which the patient may be physically incompetent to retrieve the controlled PHI for emergency treatment. In this paper, we propose a secure EHR system, HCPP (Healthcaresystem for Patient Privacy), based on cryptographic constructions and existing wireless network infrastructures, to provide privacy protection to patients under any circumstances while enabling timelyPHI retrieval for life-saving treatment in emergency situations. Furthermore, our HCPP system restricts PHI access to authorized (not arbitrary) physicians, who can be traced and held accountable if the accessed PHI is found improperly disclosed. Last but not least, HCPP leverages wireless network access to support efficient and private storage/retrieval of PHI, which underlies a secure and feasible EHR system.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing system data confidentiality is much important but in existing system framework it is not enough for to only guarantee the data confidentiality of the patient’s personal health information in the honest-but-curious cloud server model since the frequent communication between a patient and a professional physician can lead the adversary to conclude that the patient is suffering from a specific disease with a high probability. Unfortunately, the problem of how to protect both the patients’ data confidentiality and identity privacy in the distributed m-healthcare cloud computing scenario under the malicious model was left untouched.

Patients are unwilling to accept the EHR system unless their protected health information (PHI) containing highly confidential data is guaranteed proper use and disclosure, which cannot be easily achieved without patients’ control over their own PHI. However, cautions must be taken to handle emergencies in which the patient may be physically incompetent to retrieve the controlled PHI for emergency treatment a secure EHR system, HCPP (Health care system for Patient Privacy), based on cryptographic constructions and existing wireless network infrastructures, to provide privacy protection to patients under any circumstances while enabling timelyPHI retrieval for life-saving treatment in emergency situations.

2.1.1 DISADVANTAGES:

Existing applications in e-healthcare scenario can be realized through real-time, continuous vital monitoring to give immediate alerts of changes in patient status. Also, the WBAN operates in environments with open access by various people such as hospital or medical organization, which also accommodates attackers. The open wireless channel makes the data prone to be eavesdropped, modified, and injected. Many kinds of security threats have been existed, such as unauthenticated or unauthorized access, message disclosure, message modification, denial-of-service, node capture and compromised node, and routing attacks, etc. Among which two kinds of threats play the leading role, the threats from device compromise and the threats from network dynamics.

Existing problem of security is rising nowadays. Especially, the privacy of communication through Internet may be at risk of attacking in a number of ways. On-line collecting, transmitting, and processing of personal data cause a severe threat to privacy. Once the utilization of Internet-based services is concerned on-line, the lack of privacy in network communication is the main conversation in the public. This problem is far more significant in modern medical environment, as e-healthcare networks are implemented and developed. According to common standards, the network linked with general practitioners, hospitals, and social centers at a national or international scale. While suffering the risk of leaking the privacy data, such networks’ privacy information is facing great danger.

Data confidentiality is low.
Data redundancy is high.
There is a violation in data security.

2.2 PROPOSED SYSTEM:

We presented a new architecture of pseudonymiaztion for protecting privacy in E-health (PIPE) integrated pseudonymization of medical data, identity management, obfuscation of metadata with anonymous authentication to prevent disclosure attacks and statistical analysis in and suggested a secure mechanism guaranteeing anonymity and privacy in both the personal health information transferring and storage at a central m-healthcare cloud server.

We proposed an anonymous authentication of membership in dynamic groups. However, since the anonymous authentication mentioned above are established based on public key infrastructure (PKI), the need of an online certificate authority (CA) and one unique public key encryption for each symmetric key k for data encryption at the portal of authorized physicians made the overhead of the construction grow linearly with size of the group. Furthermore, the anonymity level depends on the size of the anonymity set making the anonymous authentication impractical in specific surroundings where the patients are sparsely distributed.

In this paper, the security and anonymity level of our proposed construction is significantly enhanced by associating it to the underlying Gap Bilinear Diffie-Hellman (GBDH) problem and the number of patients’ attributes to deal with the privacy leakage in patient sparsely distributed scenarios significantly, without the knowledge of which physician in the healthcare provider is professional in treating his illness, the best way for the patient is to encrypt his own PHI under a specified access policy rather than assign each physician a secret key. As a result, the authorized physicians whose attribute set satisfy the access policy can recover the PHI and the access control management also becomes more efficient.

2.2.1 ADVANTAGES:

Our advantages a patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing without privacy-preserving authentication. For comparison, to achieve the same functions of PSMPA, it could be considered as the combination of ABE and DVS that the computational complexity of PSMPA remains constant regardless of the number of directly authorized physicians and nearly half of the combination construction of ABE and DVS supporting flexible predicate.

The communication cost of PSMPA also remains constant; almost half of the combination construction and independent of the number of attributes d in that though the storage overhead of PSMPA is slightly more than the combination construction, it is independent of the number of directly authorized physicians and performs significantly better than traditional DVS, all of whose computational, communication and storage overhead increase linearly to the number of directly authorized physicians.

M-healthcare system is fully controlled and secured with encryption standards.
There is no data loss and data redundancy.
System provides full protection for patient’s data and their attributes.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In our implementation, we choose MIRACLE Library for simulating cryptographic operations using Microsoft C/C++ compilers. To achieve a comparable security of 1,024-bit RSA, According to the standards of Paring-based Crypto Librarya patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing [30] without privacy-preserving authentication. For comparison, to achieve the same functions of PSMPA, it could be considered as the combination of ABE and DVS that the computational complexity of PSMPA remains constant regardless of the number of directly authorized physicians and nearly half of the combination construction of ABE and DVS supporting flexible predicate. Fig. 5 illustrates the communication cost of PSMPA also remains constant, almost half of the combination construction and independent of the number of attributes d in vD. Fig. 6 shows that though the storage overhead of PSMPA is slightly more than the combination construction, it is independent of the number of directly authorized physicians and performs significantly better than traditional DVS, all of whose computational, communication and storage overhead increase linearly to the number of directly authorized physicians. that the computational and communication overhead of the combination construction decrease slightly faster than PSMPA as the threshold k increases, however, even when k reaches the maximum value equaling to d, the overheads are still much more than PSMPA. The comparison between our scheme and the anonymous authentication based on PKI the storage, communication and computational overhead towards N and k is identical to DVS, since to realize the same identity privacy, in all the constructions a pair of public key and private key would be assigned to each directly authorized physician and the number of signature operations is also linear to the number of physicians, independent of the threshold k. The simulation results show our PSMPA better adapts to the distributed m-healthcare cloud computing system than previous schemes, especially for enhancing the energy constrained mobile devices (the data sink’s) efficiency.

4.1 ALGORITHM

Attribute Based Designated Verifier Signature Scheme We propose a patient self-controllable and multi-level privacy-preserving cooperative authentication scheme based on ADVS to realize three levels of security and privacy requirement in distributed m-healthcare cloud computing system which mainly consists of the following five algorithms: Setup, Key Extraction, Sign, Verify and Transcript Simulation Generation. Denote the universe of attributes as U.

4.2 MODULES:

E-HEALTHCARE SYSTEM FRAMEWORK:

AUTHORIZED ACCESSIBLE PRIVACY MODEL:

SECURITY VERIFICATION:

PERFORMANCE EVALUATION:

4.3 MODULE DESCRIPTION:

E-healthcare System Framework:

E-healthcare System consists of three components: body area networks (BANs), wireless transmission networks and the healthcare providers equipped with their own cloud servers. The patient’s personal health information is securely transmitted to the healthcare provider for the authorized physicians to access and perform medical treatment. Illustrate the unique characteristics of distributed m-healthcare cloud computing systems where all the personal health information can be shared among patients suffering from the same disease for mutual support or among the authorized physicians in distributed healthcare providers and medical research institutions for medical consultation.

Authorized accessible privacy model:

Multi-level privacy-preserving cooperative authentication is established to allow the patients to authorize corresponding privileges to different kinds of physicians located in distributed healthcare providers by setting an access tree supporting flexible threshold predicates. Propose a novel authorized accessible privacy model for distributed m-healthcare cloud computing systems which consists of the following two components: an attribute based designated verifier signature scheme (ADVS) and the corresponding adversary model.

Security Verification:

The security and anonymity level of our proposed construction is significantly enhanced by associating it to the underlying Gap Bilinear Diffie-Hellman (GBDH) problem and the number of patients’ attributes to deal with the privacy leakage in patient sparsely distributed scenarios. More significantly, without the knowledge of which physician in the healthcare provider is professional in treating his illness, the best way for the patient is to encrypt his own PHI under a specified access policy rather than assign each physician a secret key. As a result, the authorized physicians whose attribute set satisfy the access policy can recover the PHI and the access control management also becomes more efficient.

Performance Evaluation:

The efficiency of PSMPA in terms of storage overhead, computational complexity and communication cost. a patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing without privacy-preserving authentication. To achieve the same security, our construction performs more efficiently than the traditional designated verifier signature for all the directly authorized physicians, where the overheads are linear to the number of directly authorized physicians.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE ENHANCEMENT:

In this paper, a novel authorized accessible privacy model and a patient self-controllable multi-level privacy preserving cooperative authentication scheme realizing three different levels of security and privacy requirement in the distributed m-healthcare cloud computing system are proposed, followed by the formal security proof and efficiency evaluations which illustrate our PSMPA can resist various kinds of malicious attacks and far outperforms previous schemes in terms of storage, computational and communication overhead.

Privacy-Preserving Detection of Sensitive Data Exposure

05/08/201902/07/2019 by admin

Statistics from security firms, research institutions and government organizations show that the numbers of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced.

In this paper, we present a privacy preserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.

1.2 INTRODUCTION

According to a report from Risk Based Security (RBS), the number of leaked sensitive data records has increased dramatically during the last few years, i.e., from 412 million in 2012 to 822 million in 2013. Deliberately planned attacks, inadvertent leaks (e.g., forwarding confidential emails to unclassified email accounts), and human mistakes (e.g., assigning the wrong privilege) lead to most of the data-leak incidents. Detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content. Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS). Straightforward realizations of data-leak detection require the plaintext sensitive data.

However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory). In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

In this paper, we propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

In our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

Our contributions are summarized as follows.

1) We describe a privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model. We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

2) We implement our detection system and perform extensive experimental evaluation on 2.6 GB Enron dataset, Internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency. Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used. In addition, we give an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

1.3 LITRATURE SURVEY

PRIVACY-AWARE COLLABORATIVE SPAM FILTERING

AUTHORS: K. Li, Z. Zhong, and L. Ramaswamy

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 5, pp. 725–739, May 2009.

EXPLANATION:

While the concept of collaboration provides a natural defense against massive spam e-mails directed at large numbers of recipients, designing effective collaborative anti-spam systems raises several important research challenges. First and foremost, since e-mails may contain confidential information, any collaborative anti-spam approach has to guarantee strong privacy protection to the participating entities. Second, the continuously evolving nature of spam demands the collaborative techniques to be resilient to various kinds of camouflage attacks. Third, the collaboration has to be lightweight, efficient, and scalable. Toward addressing these challenges, this paper presents ALPACAS-a privacy-aware framework for collaborative spam filtering. In designing the ALPACAS framework, we make two unique contributions. The first is a feature-preserving message transformation technique that is highly resilient against the latest kinds of spam attacks. The second is a privacy-preserving protocol that provides enhanced privacy guarantees to the participating entities. Our experimental results conducted on a real e-mail data set shows that the proposed framework provides a 10 fold improvement in the false negative rate over the Bayesian-based Bogofilter when faced with one of the recent kinds of spam attacks. Further, the privacy breaches are extremely rare. This demonstrates the strong privacy protection provided by the ALPACAS system.

DATA LEAK DETECTION AS A SERVICE: CHALLENGES AND SOLUTIONS

AUTHORS: X. Shu and D. Yao

PUBLISH: Proc. 8th Int. Conf. Secur. Privacy Commun. Netw., 2012, pp. 222–240

EXPLANATION:

We describe network-based data-leak detection (DLD) technique, the main feature of which is that the detection does not require the data owner to reveal the content of the sensitive data. Instead, only a small amount of specialized digests are needed. Our technique – referred to as the fuzzy fingerprint – can be used to detect accidental data leaks due to human errors or application flaws. The privacy-preserving feature of our algorithms minimizes the exposure of sensitive data and enables the data owner to safely delegate the detection to others. We describe how cloud providers can offer their customers data-leak detection as an add-on service with strong privacy guarantees. We perform extensive experimental evaluation on the privacy, efficiency, accuracy and noise tolerance of our techniques. Our evaluation results under various data-leak scenarios and setups show that our method can support accurate detection with very small number of false alarms, even when the presentation of the data has been transformed. It also indicates that the detection accuracy does not degrade when partial digests are used. We further provide a quantifiable method to measure the privacy guarantee offered by our fuzzy fingerprint framework.

QUANTIFYING INFORMATION LEAKS IN OUTBOUND WEB TRAFFIC

AUTHORS: K. Borders and A. Prakash

PUBLISH: Proc. 30th IEEE Symp. Secur. Privacy, May 2009, pp. 129–140.

EXPLANATION:

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. Todaypsilas network traffic is so voluminous that manual inspection would be unreasonably expensive. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information. These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet. We present an approach for quantifying information leak capacity in network traffic. Instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume. We take advantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer. In this paper, we present measurement algorithms for the Hypertext Transfer Protocol (HTTP), the main protocol for Web browsing. When applied to real Web browsing traffic, the algorithms were able to discount 98.5% of measured bytes and effectively isolate information leaks.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection, and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content.

Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS).

Straightforward realizations of data-leak detection require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory).

In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

2.1.1 DISADVANTAGES:

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information.

These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet.

Existing approach for quantifying information leak capacity in network traffic instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume.

We take disadvantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer.

2.2 PROPOSED SYSTEM:

We propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations.

Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data.

Our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

Our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them.

To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

2.2.1 ADVANTAGES:

We describe privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model.

We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

We implement our detection system and perform extensive experimental evaluation on internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency.

Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET
Back End : MS-SQL Server
Server : ASP .NET Web Server
Script : C# Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

FUZZY FINGERPRINT METHOD AND PROTOCOL

We describe technical details of our fuzzy fingerprint mechanism in this section. The DLD provider obtains digests of sensitive data from the data owner. The data owner uses a sliding window and Rabin fingerprint algorithm to generate short and hard to-reverse (i.e., one-way) digests through the fast polynomial modulus operation. The sliding window generates small fragments of the processed data (sensitive data or network traffic), which preserves the local features of the data and provides the noise tolerance property. Rabin fingerprints are computed as polynomial modulus operations, and can be implemented with fast XOR, shift, and table look-up operations.

The Rabin fingerprint algorithm has a unique min-wise independence property, which supports fast random fingerprints selection (in uniform distribution) for partial fingerprints disclosure. The shingle-and-fingerprint process is defined as follows. A sliding window is used to generate q-grams on an input binary string first. The fingerprints of q-grams are then computed. A shingle (q-gram) is a fixed-size sequence of contiguous bytes. For example, the 3-gram shingle set of string abcdefgh consists of six elements {abc, bcd, cde, def, efg, fgh}. Local feature preservation is accomplished through the use of shingles. Therefore, our approach can tolerate sensitive data modification to some extent, e.g., inserted tags, small amount of character substitution, and lightly reformatted data.

From the detection perspective, a straightforward method is for the DLD provider to raise an alert if any sensitive fingerprint matches the fingerprints from the traffic.1 However, this approach has a privacy issue. If there is a data leak, there is a match between two fingerprints from sensitive data and network traffic. Then, the DLD provider learns the corresponding shingle, as it knows the content of the packet. Therefore, the central challenge is to prevent the DLD provider from learning the sensitive values even in data-leak scenarios, while allowing the provider to carry out the traffic inspection.

We propose an efficient technique to address this problem. The main idea is to relax the comparison criteria by strategically introducing matching instances on the DLD provider’s side without increasing false alarms for the data owner. Specifically, i) the data owner perturbs the sensitive-data fingerprints before disclosing them to the DLD provider, and ii) the DLD provider detects leaking by a range-based comparison instead of the exact match. The range used in the comparison is pre-defined by the data owner and correlates to the perturbation procedure.

4.2 MODULES:

NETWORK SECURITY PRIVACY:

SECURITY GOAL AND THREAT MODEL:

PRIVACY GOAL AND THREAT MODEL:

PRIVACY-ENHANCING DLD:

EXPERIMENTAL EVALUATION

4.3 MODULE DESCRIPTION:

NETWORK SECURITY PRIVACY:

Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.

SECURITY GOAL AND THREAT MODEL:

We categorize three causes for sensitive data to appear on the outbound traffic of an organization, including the legitimate data use by the employees.

• Case I Inadvertent data leak: The sensitive data is accidentally leaked in the outbound traffic by a legitimate user. This paper focuses on detecting this type of accidental data leaks over supervised network channels. Inadvertent data leak may be due to human errors such as forgetting to use encryption, carelessly forwarding an internal email and attachments to outsiders or due to application flaws (such as described in a supervised network channel could be an unencrypted channel or an encrypted channel where the content in it can be extracted and checked by an authority. Such a channel is widely used for advanced NIDS where MITM (man-in-the-middle) SSL sessions are established instead of normal SSL sessions.

• Case II Malicious data leak: A rogue insider or a piece of stealthy software may steal sensitive personal or organizational data from a host. Because the malicious adversary can use strong private encryption, steganography or covert channels to disable content-based traffic inspection, this type of leaks is out of the scope of our network-based solution host-based defenses (such as detecting the infection onset need to be deployed instead.

• Case III Legitimate and intended data transfer: The sensitive data is sent by a legitimate user intended for legitimate purposes. In this paper, we assume that the data owner is aware of legitimate data transfers and permits such transfers. So the data owner can tell whether a piece of sensitive data in the network traffic is a leak using legitimate data transfer policies.

PRIVACY GOAL AND THREAT MODEL:

DLD provider from gaining knowledge of sensitive data during the detection process, we need to set up a privacy goal that is complementary to the security goal above. We model the DLD provider as a semi-honest adversary, who follows our protocol to carry out the operations, but may attempt to gain knowledge about the sensitive data of the data owner. Our privacy goal is defined as follows. The DLD provider is given digests of sensitive data from the data owner and the content of network traffic to be examined. The DLD provider should not find out the exact value of a piece of sensitive data with a probability greater than 1 K, where K is an integer representing the number of all possible sensitive-data candidates that can be inferred by the DLD provider. We present a privacy-preserving DLD model with a new fuzzy fingerprint mechanism to improve the data protection against semi-honest DLD provider. We generate digests of sensitive data through a one-way function, and then hide the sensitive values among other non-sensitive values via fuzzification. The privacy guarantee of such an approach is much higher than 1 K when there is no leak in traffic, because the adversary’s inference can only be gained through brute-force guesses. The traffic content is accessible by the DLD provider in plaintext. Therefore, in the event of a data leak, the DLD provider may learn sensitive information from the traffic, which is inevitable for all deep packet inspection approaches. Our solution confines the amount of maximal information learned during the detection and provides quantitative guarantee for data privacy.

PRIVACY-ENHANCING DLD:

Our privacy-preserving data-leak detection method supports practical data-leak detection as a service and minimizes the knowledge that a DLD provider may gain during the process. Fig. 1 lists the six operations executed by the data owner and the DLD provider in our protocol. They include PREPROCESS run by the data owner to prepare the digests of sensitive data, RELEASE for the data owner to send the digests to the DLD provider, MONITOR and DETECT for the DLD provider to collect outgoing traffic of the organization, compute digests of traffic content, and identify potential leaks, REPORT for the DLD provider to return data-leak alerts to the data owner where there may be false positives (i.e., false alarms), and POSTPROCESS for the data owner to pinpoint true data-leak instances. Details are presented in the next section. The protocol is based on strategically computing data similarity, specifically the quantitative similarity between the sensitive information and the observed network traffic. High similarity indicates potential data leak. For data-leak detection, the ability to tolerate a certain degree of data transformation in traffic is important. We refer to this property as noise tolerance.

Our key idea for fast and noise-tolerant comparison is the design and use of a set of local features that are representatives of local data patterns, e.g., when byte b2 appears in the sensitive data, it is usually surrounded by bytes b1 and b3 forming a local pattern b1, b2, b3. Local features preserve data patterns even when modifications (insertion, deletion, and substitution) are made to parts of the data. For example, if a byte b4 is inserted after b3, the local pattern b1, b2, b3 is retained though the global pattern (e.g., a hash of the entire document) is destroyed. To achieve the privacy goal, the data owner generates a special type of digests, which we call fuzzy fingerprints. Intuitively, the purpose of fuzzy fingerprints is to hide the true sensitive data in a crowd. It prevents the DLD provider from learning its exact value. We describe the technical details next.

EXPERIMENTAL EVALUATION:

Our data-leak detection solution can be outsourced and be deployed in a fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected.

Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. Our fuzzy fingerprint framework in Python, including packet collection, shingling, Rabin fingerprinting, as well as partial disclosure and fingerprint filter extensions Rabin fingerprint is based on cyclic redundancy code (CRC). We use the padding scheme mentioned in to handle small inputs. In all experiments, the shingles are in 8-byte, and the fingerprints are in 32-bit (33-bit irreducible polynomials in Rabin fingerprint).

We set up a networking environment in VirtualBox, and make a scenario where the sensitive data is leaked from a local network to the Internet. Multiple users’ hosts (Windows 7) are put in the local network, which connect to the Internet via a gateway (Fedora). Multiple servers (HTTP, FTP, etc.) and an attacker-controlled host are put on the Internet side. The gateway dumps the network traffic and sends it to a DLD server/provider (Linux). Using the sensitive-data fingerprints defined by the users in the local network, the DLD server performs off-line data-leak detection.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.1 CONCLUSION

Our fuzzy fingerprint method differs from these solutions and enables its adopter to provide dataleak detection as a service. The customer or data owner does not need to fully trust the DLD provider using our approach. Bloom filter is a space-saving data structure for set membership test, and it is used in network security from network layer in the fuzzy Bloom filter invented in constructs a special Bloom filter that probabilistically sets the corresponding filter bits to 1’s. We designed to support a resource-sufficient routing scheme; it is a potential privacy-preserving technique. We do not invent a variant of Bloom filter for our fuzzy fingerprint, and our fuzzification process is separate from membership test. The advantage of separating fingerprint fuzzification from membership test is that it is flexible to test whether the fingerprint is sensitive with or without fuzzification

Our fuzzy fingerprint solution for data-leak detection, there are other privacy-preserving techniques invented for specific processes, e.g., DATA matching or for general purpose use, e.g., secure multi-party computation (SMC). SMC is a cryptographic mechanism, which supports a wide range of fundamental arithmetic, set, and string operations as well as complex functions such as knapsack computation, automated trouble-shooting, network event statistics, private information retrieval, genomic computation, private database query, private join operations and distributed data mining. The provable privacy guarantees offered by SMC comes at a cost in terms of computational complexity and realization difficulty. The advantage of our approach is its concision and efficiency.

8.2 FUTURE ENHANCEMENT:

We proposed fuzzy fingerprint, a privacy-preserving data-leak detection model and present its realization. Using special digests, the exposure of the sensitive data is kept to a minimum during the detection. We have conducted extensive experiments to validate the accuracy, privacy, and efficiency of our solutions. For future work, we plan to focus on designing a host-assisted mechanism for the complete data-leak detection for large-scale organizations.

Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter

05/08/201902/07/2019 by admin

Panda Public Auditing for Shared Data with Efficient User Revocation in the Cloud

05/08/201902/07/2019 by admin

ABSTRACT:

With data storage and sharing services in the cloud, users can easily modify and share data as a group. To ensure share data integrity can be verified publicly, users in the group need to compute signatures on all the blocks in shared data. Different blocks in shared data are generally signed by different users due to data modifications performed by different users. For security reasons, once a user is revoked from the group, the blocks which were previously signed by this revoked user must be re-signed by an existing user. The straight forward method, which allows an existing user to download the corresponding part of shared data and re-sign it during user revocation, is inefficient due to the large size of shared data in the cloud. In this paper, we propose a novel public auditing mechanism

For the integrity of shared data with efficient user revocation in mind. By utilizing the idea of proxy re-signatures, we allow the cloud tore-sign blocks on behalf of existing users during user revocation, so that existing users do not need to download and re-sign blocks by themselves. In addition, a public verifier is always able to audit the integrity of shared data without retrieving the entire data from the

Cloud, even if some part of shared data has been re-signed by the cloud. Moreover, our mechanism is able to support batch auditing by verifying multiple auditing tasks simultaneously. Experimental results show that our mechanism can significantly improve the efficiency of user revocation.

INTRODUCTION

With data storage and sharing services (such as Dropbox and Google Drive) provided by the cloud, people can easily work together as a group by sharing data with each other. More specifically, once a user creates shared data in the cloud, every user in the group is able to not only access and modify shared data, but also share the latest version of the shared data with the rest of the group. Although cloud providers promise a more secure and reliable environment to the users, the integrity of data in the cloud may still be compromised, due to the existence of hardware/software failures and human errors.

To protect the integrity of data in the cloud, a number of mechanisms have been proposed. In these mechanisms, a signature is attached to each block in data, and the integrity of data relies on the correctness of all the signatures. One of the most significant and common features of these mechanisms is to allow a public verifier to efficiently check data integrity in the cloud without downloading the entire data, referred to as public auditing (or denoted as Provable Data Possession). This public verifier could be a client who would like to utilize cloud data for particular purposes (e.g., search, computation, data mining, etc.) or a thirdparty auditor (TPA) who is able to provide verification services on data integrity to users. Most of the previous works focus on auditing the integrity of personal data. Different from these works, several recent works focus on how to preserve identity privacy from public verifiers when auditing the integrity of shared data. Unfortunately, none of the above mechanisms, considers the efficiency of user revocation when auditing the correctness of shared data in the cloud.

With shared data, once a user modifies a block, she also needs to compute a new signature for the modified block. Due to the modifications from different users, different blocks are signed by different users. For security reasons, when a user leaves the group or misbehaves, this user must be revoked from the group. As a result, this revoked user should no longer be able to access and modify shared data, and the signatures generated by this revoked user are no longer valid to the group. Therefore, although the content of shared data is not changed during user revocation, the blocks, which were previously signed by the revoked user, still need to be re-signed by an existing user in the group. As a result, the integrity of the entire data can still be verified with the public keys of existing users only.

Since shared data is outsourced to the cloud and users no longer store it on local devices, a straightforward method to re-compute these signatures during user revocation is to ask an existing user to first download the blocks previously signed by the revoked user verify the correctness of these blocks, then re-sign these blocks, and finally upload the new signatures to the cloud. However, this straightforward method may cost the existing user a huge amount of communication and computation resources by downloading and verifying blocks, and by re-computing and uploading signatures, especially when the number of re-signed blocks is quite large or the membership of the group is frequently changing. To make this matter even worse, existing users may access their data sharing services provided by the cloud with resource limited devices, such as mobile phones, which further prevents existing users from maintaining the correctness of shared data efficiently during user revocation.

Clearly, if the cloud could possess each user’s private key, it can easily finish the re-signing task for existing users without asking them to download and re-sign blocks. However, since the cloud is not in the same trusted domain with each user in the group, outsourcing every user’s private key to the cloud would introduce significant security issues. Another important problem we need to consider is that the re-computation of any signature during user revocation should not affect the most attractive property of public auditing — auditing data integrity publicly without retrieving the entire data. Therefore, how to efficiently reduce the significant burden to existing users introduced by user revocation, and still allow a public verifier to check the integrity of shared data without downloading the entire data from the cloud, is a challenging task.

In this paper, we propose Panda, a novel public auditing mechanism for the integrity of shared data with efficient user revocation in the cloud. In our mechanism, by utilizing the idea of proxy re-signatures, once a user in the group is revoked, the cloud is able to resign the blocks, which were signed by the revoked user, with a re-signing key. As a result, the efficiency of user revocation can be significantly improved, and computation and communication resources of existing users can be easily saved. Meanwhile, the cloud, who is not in the same trusted domain with each user, is only able to convert a signature of the revoked user into a signature of an existing user on the same block, but it cannot sign arbitrary blocks on behalf of either the revoked user or an existing user. By designing a new proxy re-signature scheme with nice properties, which traditional proxy resignatures do no have, our mechanism is always able to check the integrity of shared data without retrieving the entire data from the cloud.

LITRATURE SURVEY

PUBLIC AUDITING FOR SHARED DATA WITH EFFICIENT USER REVOATION IN THE CLOUD

PUBLICATION: B. Wang, B. Li, and H. Li, in the Proceedings of IEEE INFOCOM 2013, 2013, pp. 2904–2912.

With data storage and sharing services in the cloud, users can easily modify and share data as a group. To ensure shared data integrity can be veriﬁed publicly, users in the group need to compute signatures on all the blocks in shared data. Different blocks in shared data are generally signed by different users due to data modiﬁcations performed by different users. For security reasons, once a user is revoked from the group, the blocks which were previously signed by this revoked user must be re-signed by an existing user. The straightforward method, which allows an existing user to download the corresponding part of shared data and re-sign it during user revocation, is inefﬁcient due to the large size of shared data in the cloud. In this paper, we propose a novel public auditing mechanism for the integrity of shared data with efﬁcient user revocation in mind. By utilizing the idea of proxy re-signatures, we allow the cloud to re-sign blocks on behalf of existing users during user revocation, so that existing users do not need to download and re-sign blocks by themselves. In addition, a public veriﬁer is always able to audit the integrity of shared data without retrieving the entire data from the cloud, even if some part of shared data has been re-signed by the cloud. Moreover, our mechanism is able to support batch auditing by verifying multiple auditing tasks simultaneously. Experimental results show that our mechanism can signiﬁcantly improve the efﬁciency of user revocation.

A VIEW OF CLOUD COMPUTING, COMMUNICATIONS OF THE ACM

PUBLICATION: M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, vol. 53, no. 4, pp. 50–58, Apirl 2010.

Cloud computing, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require the large capital outlays in hardware to deploy their service or the human expense to operate it. They need not be concerned about overprovisioning for a service whose popularity does not meet their predictions, thus wasting costly resources, or underprovisioning for one that becomes wildly popular, thus missing potential customers and revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1,000 servers for one hour costs no more than using one server for 1,000 hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT.

PROVABLE DATA POSSESSION AT UNTRUSTED STORES

PUBLICATION: G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D. Song, in the Proceedings of ACM CCS 2007, 2007, pp. 598–610.

We introduce a model for provable data possession (PDP) that allows a client that has stored data at an untrusted server to verify that the server possesses the original data without retrieving it. The model generates probabilistic proofs of possession by sampling random sets of blocks from the server, which drastically reduces I/O costs. The client maintains a constant amount of metadata to verify the proof. The challenge/response protocol transmits a small, constant amount of data, which minimizes network communication. Thus, the PDP model for remote data checking supports large data sets in widely-distributed storage systems. We present two provably-secure PDP schemes that are more eﬃcient than previous solutions, even when compared with schemes that achieve weaker guarantees. In particular, the overhead at the server is low (or even constant), as opposed to linear in the size of the data. Experiments using our implementation verify the practicality of PDP and reveal that the performance of PDP is bounded by disk I/O and not by cryptographic computation.

COMPACT PROOFS OF RETRIEVABILITY

PUBLICATION: H. Shacham and B. Waters, in the Proceedings of ASIACRYPT 2008. Springer-Verlag,2008,pp. 90–107.

In a proof-of-retrievability system, a data storage center must prove to a verifier that he is actually storing all of a client’s data. The central challenge is to build systems that are both effcient and provably secure | that is, it should be possible to extract the client’s data from any prover that passes a verification check. In this paper, we give the rst proof-of- retrievability schemes with full proofs of security against arbitrary adversaries in the strongest model, that of Juels and Kaliski. Our rst scheme, built from BLS signatures and secure in the random oracle model, features a proof-of-retrievability protocol in which the client’s query and server’s response are both extremely short. This scheme allows public verify ability: anyone can act as a verifier, not just the le owner. Our second scheme, which builds on pseudorandom functions (PRFs) and is secure in the standard model, allows only private verification. It features a proof-of- retrievability protocol with an even shorter server’s response than our rst scheme, but the client’s query is long. Both schemes rely on homomorphic properties to aggregate a proof into one small authenticator value.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

An existing system the file uploaded in cloud which not signed by user in each time of upload. So that integrity of shared data is not possible in existing system. However, since the cloud is not in the same trusted domain with each user in the group, outsourcing every user’s private key to the cloud would introduce significant security issue.

2.1.1 DISADVANTAGES:

2.2 PROPOSED SYSTEM:

In our Proposed system may lie to verifiers about the incorrectness of shared data in order to save the reputation of its data services and avoid losing money on its data services. In addition, we also assume there is no collusion between the cloud and any user during the design of our mechanism. Generally, the incorrectness of share data under the above semi trusted model can be introduced by hardware/software failures or human errors happened in the cloud. Considering these factors, users do not fully trust the cloud with the integrity of shared data.

2.2.1 ADVANTAGES:

1.Blocking User account

2.Security question

3.Login with secret key in each time

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP
Front End : Microsoft Visual Studio .NET 2008
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data. The physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 BLOCK DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

1. USER MODULE:

Registration

File Upload

Download

Reupload

Unblock module

2. AUDITOR MODULE:

File Verification module

View File

3. ADMIN MODULE:

View Files

Block user

4.3 MODULE DESCRIPTION:

USER MODULE:

Registration:

In this module each user register his user details for using files. Only registered user can able to login in cloud server .

File Upload:

In this module user upload a block of files in the cloud with encryption by using his secret key. This ensure the files to be protected from unauthorized user.

Download:

This module allows the user to download the file using his secret key to decrypt the downloaded data of blocked user and verify the data and reupload the block of file into cloud server with encryption .This ensure the files to be protected from unauthorized user.

Reupload:

This module allow the user to reupload the downloaded files of blocked user into cloud server with resign the files(i.e) the files is uploaded with new signature like new secret with encryption to protected the data from unauthorized user.

Unblock Module:

This module allow the user to unblock his user account by answering his security question regarding to answer that provided by his at the time of registration. Once the answer is matched to the answer of registration time answer then only account will be unlocked.

AUDITOR MODULE:

File Verification module:

The public verifier is able to correctly check the integrity of shared data. The public verifier can audit the integrity of shared data without retrieving the entire data from the cloud, even if some blocks in shared data have been re-signed by the cloud.

Files View:

In this module public auditor view the all details of upload, download, blocked user, reupload.

ADMIN MODULE:

View Files:

In this module public auditor view the all details of upload, download, blocked user, reupload.

Block User:

In this module admin block the misbehave user account to protect the integrity of shared data

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE SPECIFICATION:

6.1 FEATURES OF .NET:

6.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

6.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

6.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

6.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

6.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.1 CONCLUSION

In this paper, we proposed a new public auditing mechanism for shared data with efficient user revocation in the cloud. When a user in the group is revoked, we allow the semi-trusted cloud to re-sign blocks that were signed by the revoked user with proxy re-signatures. Experimental results show that the cloud can improve the efficiency of user revocation, and existing users in the group can save a significant amount of computation and communication resources during user revocation.

CHAPTER 9

Neighbor Similarity Trust against Sybil Attack in P2P E-Commerce

05/08/201902/07/2019 by admin

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

1.2 INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

1.3 LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

2.1.0 DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

Fake Users Enters Easy.
This makes Sybil attacks.

2.2 PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

2.2.0 ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

It is Helpful to find Sybil Attacks.
It is used to Find Fake UserID.
It is feasible to limit the number of attack edges in online social networks by relationship rating.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.0 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.0 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET
Script : C# Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGNS

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

LEVEL 0:

Neighbor Nodes

Source

LEVEL 1:

P2P Sybil Trust Mode

Send Data Request

LEVEL 2:

Data Receive

P2P ACK

Active Attack (Malicious Node)

Send Data Request

LEVEL 3:

3.3 UML DIAGRAMS

3.3.0 USECASE DIAGRAM:

SERVER CLIENT

3.3.1 CLASS DIAGRAM:

3.3.2 SEQUENCE DIAGRAM:

3.4 ACITVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group peers join a group; they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has. Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating. The method of detection of Sybil attack is depicted in Fig. 2. A1 and A2 refer to the same peer but with different identities.

Our approach, the identifiers are only propagated by the peers who exhibit neighbor similarity trust. Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers. SybilTrust proposes that an honest peer should not have an excessive number of neighbors. The neighbors we refer should be member peers existing in a group. The restriction helps to bind the number of peers against any additional attack among the neighbors. If there are too many neighbors, SybilTrust will (internally) only use a subset of the peer’s edges while ignoring all others. Following Liben-Nowell and Kleinberg, we define the attributes of the given pair of peers as the intersection of the sets of similar products.

4.1 MODULES:

SIMILARITY TRUST RELATIONSHIP:

NEIGHBOR SIMILARITY TRUST:

DETECTION OF SYBIL ATTACK:

SECURITY AND PERFORMANCE:

4.2 MODULES DESCRIPTION:

SIMILARITY TRUST RELATIONSHIP:

We focus on the active attacks in P2P e-commerce. When a peer is compromised, all the information will be extracted. In our work, we have proposed use of SybilTrust which is based on neighbor similarity relationship of the peers. SybilTrust is efficient and scalable to group P2P e-commerce network. Sybil attack peers may attempt to compromise the edges or the peers of the group P2P e-commerce. The Sybil attack peers can execute further malicious actions in the network. The threat being addressed is the identity active attacks as peers are continuously doing the transactions in the peers to show that each controller only admitted the honest peers.

Our method makes assumptions that the controller undergoes synchronization to prove whether the peers which acted as distributor of identifiers had similarityor not. If a peer never had similarity, the peer is assumed to have been a Sybil attack peer. Pairing method is used to generate an expander graph with expansion factor of high probability. Every pair of neighbor peers share a unique symmetric secret key (the edge key), established out of band for authenticating each other peers may deliberately cause Byzantine faults in which their multiple identity and incorrect behavior ends up undetected.

The Sybil attack peers can create more non-existent links. The protocols and services for P2P, such as routing protocols must operate efficiently regardless of the group size. In the neighbor similarity trust, peers must have a self-healing in order to recover automatically from any state. Sybil attack can defeat replication and fragmentation performed in distributed hash tables. Geographic routing in P2P can also be a routing mechanism which can be compromised by Sybil peers.

NEIGHBOR SIMILARITY TRUST:

We present a Sybil identification algorithm that takes place in a neighbor similarity trust. The directed graph has edges and vertices. In our work, we assume V is the set of peers and E is the set of edges. The edges in a neighbor similarity have attack edges which are safeguarded from Sybil attacks. A peer u and a Sybil peer v can trade whether one is Sybil or not. Being in a group, comparison can be done to determine the number of peers which trade with peer. If the peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes a peer existing in a group has six types of keys. The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete. Our algorithm adaptively tests the suspected peer while maintaining the neighbor similarity trust connection based on time.

DETECTION OF SYBIL ATTACK:

Sybil attack, a malicious peer must try to present multiple distinct identities. This can be achieved by either generating legal identities or by impersonating other normal peers. Some peers may launch arbitrary attacks to interfere with P2P e-commerce operations, or the normal functioning of the network. According to an attack can succeed to launch a Sybil attack by:

_ Heterogeneous configuration: in this case, malicious peers can have more communication and computation resources than the honest peers.

_ Message manipulation: the attacker can eavesdrop on nearby communications with other parties. This means a attacker gets and interpolates information needed to impersonate others. Major attacks in P2P e-commerce can be classified as passive and active attacks.

_ Passive attack: It listens to incoming and outgoing messages, in order to infer the relevant information from the transmitted recommendations, i.e., eavesdropping, but doesn’t harm the system. A peer can be in passive mode and later in active mode.

_ Active attack: When a malicious peer receives a recommendation for forwarding, it can modify, or when requested to provide recommendations on another peer, it can inflate or bad mouth. The bad mouthing is a situation where a malicious peer may collude with other malicious peers to revenge the honest peer. In the Sybil attack, a malicious peer generates a large number of identities and uses them together to disrupt normal operation.

SECURITY AND PERFORMANCE:

We evaluate the performance of the proposed SybilTrust. We measure two metrics, namely, non-trustworthy rate and detection rate. Non-trustworthy rate is the ratio of the number of honest peers which are erroneously marked as Sybil/malicious peer to the number of total honest peers. Detection rate is the proportion of detected Sybil/ malicious peers to the total Sybil/malicious peers. Communication Cost. The trust level is sent with the recommendation feedback from one peer to another. If a peer is compromised, the information is broadcasted to all peers as revocation of the trust level is being done. Computation Cost. The sybilTrust approach is efficient in the computation of polynomial evaluation. The calculation of the trust level evaluation is based on a pseudo-random function (PRF). PRF is a deterministic function.

In our simulation, we use C# .NET tool. Each honest and malicious peer interacted with a random number of peers defined by a uniform distribution. All the peers are restricted to the group. Our approach, P2P e-commerce community has a total of 3 different categories of interest. The transaction interactions between peers with similar interest can be defined as successful or unsuccessful, expressed as positive or negative respectively. The impact of the first two parameters on performance of the mechanism is evaluated in the percentage of malicious peers replied is randomly chosen by each malicious peer. Transactions with 10 to 40 percent malicious peers are done.

Our SybilTrust approach detects more malicious peers compared to Eigen Trust and Eigen Group Trust [26] as shown in Fig. 4. Fig. 4. shows the detection rates of the P2P when the number of malicious peers increases. When the number of deployed peers is small, e.g., 40 peers, the chance that no peers are around a malicious peer is high. Fig. 4 illustrates the variation of non-trustworthy rates of different numbers of honest peers as the number of malicious peer increases. It is shown that the non-trustworthy rate increases as the number of honest peers and malicious peers increase. The reason is that when there are more malicious peers, the number of target groups is larger. Moreover, this is because neighbor relationship is used to categorize peers in the

We proposed approach. The number of target-groups also increases when the number of honest peers is higher. As a result, the honest peers are examined more times, and the chance that an honest peer is erroneously determined as a Sybil/malicious peer increases, although more Sybil attack peer can also be identified. Fig. 4 displays the detection rate when the reply rate of each malicious peer is the same. The detection rate does not decrease when the reply rate is more than 80 percent, because of the enhancement.

The enhancement could still be found even when a malicious peer replies to almost all of its Sybil attack peer requests. Furthermore, the detection rate is higher as the number of malicious peers becomes more, which means the proposed mechanism is able to resist the Sybil attack from more malicious peers. The detection rate is still more than 80 percent in the sparse network, which according to the definition of a sparse network detection rate reaches 95 percent when the number of legitimate nodes is 300. It is also because the number of target groups increases as the number of malicious peer’s increases and the honest peers are examined more times. Therefore, the rate that an honest peer is erroneously identified as a Sybil/malicious peer also increases.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION AND FUTURE:

We presented SybilTrust, a defense against Sybil attack in P2P e-commerce. Compared to other approaches, our approach is based on neighborhood similarity trust in a group P2P e-commerce community. This approach exploits the relationship between peers in a neighborhood setting. Our results on real-world P2P e-commerce confirmed fastmixing property hence validated the fundamental assumption behind SybilGuard’s approach. We also describe defense types such as key validation, distribution, and position verification. This method can be done at in simultaneously with neighbor similarity trust which gives better defense mechanism. For the future work, we intend to implement SybilTrust within the context of peers which exist in many groups. Neighbor similarity trust helps to weed out the Sybil peers and isolate maliciousness to specific Sybil peer groups rather than allow attack in honest groups with all honest peers.

Malware Propagation in Large-Scale Networks

05/08/201902/07/2019 by admin

Malware Propagation in Large-Scale NetworksAbstract—Malware is pervasive in networks, and poses a critical threat to network security. However, we have very limitedunderstanding of malware behavior in networks to date. In this paper, we investigate how malware propagates in networks from aglobal perspective. We formulate the problem, and establish a rigorous two layer epidemic model for malware propagation fromnetwork to network. Based on the proposed model, our analysis indicates that the distribution of a given malware follows exponentialdistribution, power law distribution with a short exponential tail, and power law distribution at its early, late and final stages, respectively.Extensive experiments have been performed through two real-world global scale malware data sets, and the results confirm ourtheoretical findings.Index Terms—Malware, propagation, modelling, power lawÇ1 INTRODUCTIONMALWARE are malicious software programs deployedby cyber attackers to compromise computer systemsby exploiting their security vulnerabilities. Motivated byextraordinary financial or political rewards, malware ownersare exhausting their energy to compromise as many networkedcomputers as they can in order to achieve theirmalicious goals. A compromised computer is called a bot,and all bots compromised by a malware form a botnet. Botnetshave become the attack engine of cyber attackers, andthey pose critical challenges to cyber defenders. In order tofight against cyber criminals, it is important for defenders tounderstand malware behavior, such as propagation ormembership recruitment patterns, the size of botnets, anddistribution of bots.To date, we do not have a solid understanding about thesize and distribution of malware or botnets. Researchershave employed various methods to measure the size of botnets,such as botnet infiltration [1], DNS redirection [3],external information [2]. These efforts indicate that the sizeof botnets varies from millions to a few thousand. There areno dominant principles to explain these variations. As aresult, researchers desperately desire effective models andexplanations for the chaos. Dagon et al. [3] revealed thattime zone has an obvious impact on the number of availablebots. Mieghem et al. [4] indicated that network topology hasan important impact on malware spreading through theirrigorous mathematical analysis. Recently, the emergence ofmobile malware, such as Cabir [5], Ikee [6], and Brador [7],further increases the difficulty level of our understandingon how they propagate. More details about mobile malwarecan be found at a recent survey paper [8]. To the best of ourknowledge, the best finding about malware distribution inlarge-scale networks comes from Chen and Ji [9]: the distributionis non-uniform. All this indicates that the research inthis field is in its early stage.The epidemic theory plays a leading role in malwarepropagation modelling. The current models for malwarespread fall in two categories: the epidemiology model andthe control theoretic model. The control system theorybased models try to detect and contain the spread of malware[10], [11]. The epidemiology models are more focusedon the number of compromised hosts and their distributions,and they have been explored extensively in the computerscience community [12], [13], [14]. Zou et al. [15] useda susceptible-infected (SI) model to predict the growth ofInternet worms at the early stage. Gao and Liu [16] recentlyemployed a susceptible-infected-recovered (SIR) model todescribe mobile virus propagation. One critical conditionfor the epidemic models is a large vulnerable populationbecause their principle is based on differential equations.More details of epidemic modelling can be find in [17]. Aspointed by Willinger et al. [18], the findings, which weextract from a set of observed data, usually reflect parts ofthe studied objects. It is more reliable to extract theoreticalresults from appropriate models with confirmation fromsufficient real world data set experiments. We practice thisprinciple in this study.In this paper, we study the distribution of malware interms of networks (e.g., autonomous systems (AS), ISPdomains, abstract networks of smartphones who share thesame vulnerabilities) at large scales. In this kind of setting,we have a sufficient volume of data at a large enough scaleto meet the requirements of the SI model. Different from the_ S. Yu is with the School of Information Technology, Deakin University,Burwood, Victoria 3125, Australia. E-mail: syu@deakin.edu.au._ G. Gu is with the Department of Computer Science and Engineering,Texas A&M University, College Station, TX 77843-3112.E-mail: guofei@cse.tamu.edu._ A. Barnawi is with the Faculty of Computing and IT, King AbdulazizUniversity, Jeddah, Saudi Arabia. E-mail: ambarnawi@kau.edu.sa._ S. Guo is with the School of Computer Science and Engineering, The Universityof Aizu, Aizuwakamatsu, Japan. E-mail: sguo@u-aizu.ac.jp._ I. Stojmenovic is with the School of Information Technology, DeakinUniversity, Australia; King Abdulaziz University, Jeddah, Saudi Arabia;and the School of EECS, University of Ottawa, Ottawa, ON K1N 6N5,Canada. E-mail: ivan@site.uottawa.ca.Manuscript received 1 Jan. 2014; revised 14 Apr. 2014; accepted 15 Apr. 2014.Date of publication 28 Apr. 2014; date of current version 1 Dec. 2014.Recommended for acceptance by F. Bonchi.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2320725170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 20151041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.traditional epidemic models, we break our model into twolayers. First of all, for a given time since the breakout of amalware, we calculate how many networks have been compromisedbased on the SI model. Second, for a compromisednetwork, we calculate how many hosts have beencompromised since the time that the network was compromised.With this two layer model in place, we can determinethe total number of compromised hosts and theirdistribution in terms of networks. Through our rigorousanalysis, we find that the distribution of a given malwarefollows an exponential distribution at its early stage, andobeys a power law distribution with a short exponential tailat its late stage, and finally converges to a power law distribution.We examine our theoretical findings through twolarge-scale real-world data sets: the Android based malware[19] and the Conficker [20]. The experimental resultsstrongly support our theoretical claims. To the best of ourknowledge, the proposed two layer epidemic model andthe findings are the first work in the field.Our contributions are summarized as follows._ We propose a two layer malware propagation modelto describe the development of a given malware atthe Internet level. Compared with the existing singlelayer epidemic models, the proposed model representsmalware propagation better in large-scalenetworks._ We find the malware distribution in terms of networksvaries from exponential to power law witha short exponential tail, and to power law distributionat its early, late, and final stage, respectively.These findings are first theoretically provedbased on the proposed model, and then confirmedby the experiments through the two large-scalereal-world data sets.The rest of the paper is structured as follows. Relatedwork is briefly listed in Section 2. We present the preliminariesfor the proposed model in Section 3. The studiedproblem is discussed in Section 4. A two layer malwarepropagation model is established in Section 5, and followedby a rigorous mathematical analysis in Section 6. Experimentsare conducted to confirm our findings in Section 7. InSection 8, we provide a further discussion about the study.Finally, we summarize the paper and present future workin Section 9.2 RELATED WORKThe basic story of malware is as follows. A malware programerwrites a program, called bot or agent, and theninstalls the bots at compromised computers on the Internetusing various network virus-like techniques. All ofhis bots form a botnet, which is controlled by its ownersto commit illegal tasks, such as launching DDoS attacks,sending spam emails, performing phishing activities, andcollecting sensitive information. There is a command andcontrol (C&C) server(s) to communicate with the bots andcollect data from bots. In order to disguise himself fromlegal forces, the botmaster changes the url of his C&C frequently,e.g., weekly. An excellent explanation about thiscan be found in [1].With the significant growing of smartphones, we havewitnessed an increasing number of mobile malware. Malwarewriters have develop many mobile malware in recentyears. Cabir [5] was developed in 2004, and was the firstmalware targeting on the Symbian operating system formobile devices. Moreover, it was also the first malwarepropagating via Bluetooth. Ikee [6] was the first mobile malwareagainst Apple iPhones, while Brador [7] was developedagainst Windows CE operating systems. The attackvictors for mobile malware are diverse, such as SMS, MMS,Bluetooth, WiFi, and Web browsing. Peng et al. [8] presentedthe short history of mobile malware since 2004, andsurveyed their propagation models.A direct method to count the number of bots is to use botnetinfiltration to count the bot IDs or IP addresses. Stone-Gross et al. [1] registered the URL of the Torpig botnetbefore the botmaster, and therefore were able to hijack theC&C server for ten days, and collect about 70G data fromthe bots of the Torpig botnet. They reported that the footprintof the Torpig botnet was 182,800, and the median andaverage size of the Torpig’s live population was 49,272 and48,532, respectively. They found 49,294 new infections duringthe ten days takeover. Their research also indicated thatthe live population fluctuates periodically as users switchbetween being online and offline. This issue was also tackedby Dagon et al. in [3].Another method is to use DNS redirection. Dagon et al.[3] analyzed captured bots by honypot, and then identifiedthe C&C server using source code reverse engineeringtools. They then manipulated the DNS entry which isrelated to a botnet’s IRC server, and redirected the DNSrequests to a local sinkhole. They therefore could countthe number of bots in the botnet. As discussed previously,their method counts the footprint of the botnet, whichwas 350,000 in their report.In this paper, we use two large scale malware data setsfor our experiments. Conficker is a well-known and one ofthe most recently widespread malware. Shin et al. [20] collecteda data set about 25 million Conficker victims from allover the world at different levels. At the same time, malwaretargeting on Android based mobile systems are developingquickly in recent years. Zhou and Jiang [19] collecteda large data set of Android based malware.In [2], Rajab et al. pointed out that it is inaccurate tocount the unique IP addresses of bots because DHCP andNAT techniques are employed extensively on the Internet([1] confirms this by their observation that 78.9 percent ofthe infected machines were behind a NAT, VPN, proxy,or firewall). They therefore proposed to examine the hitsof DNS caches to find the lower bound of the size of agiven botnet.Rajab et al. [21] reported that botnets can be categorizedinto two major genres in terms of membership recruitment:worm-like botnets and variable scanning botnets. The latterweights about 82 percent in the 192 IRC bots that they investigated,and is the more prevalent class seen currently. Suchbotnets usually perform localized and non-uniform scanning,and are difficult to track due to their intermittent andcontinuously changing behavior. The statistics on the lifetimeof bots are also reported as 25 minutes on average with90 percent of them staying for less than 50 minutes.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 171Malware propagation modelling has been extensivelyexplored. Based on epidemiology research, Zou et al. [15]proposed a number of models for malware monitoring atthe early stage. They pointed out that these kinds of modelare appropriate for a system that consists of a large numberof vulnerable hosts; in other words, the model is effective atthe early stage of the outbreak of malware, and the accuracyof the model drops when the malware develops further. Asa variant of the epidemic category, Sellke et al. [12] proposeda stochastic branching process model for characterizingthe propagation of Internet worms, which especiallyfocuses on the number of compromised computers againstthe number of worm scans, and presented a closed formexpression for the relationship. Dagon et al. [3] extendedthe model of [15] by introducing time zone information aðtÞ,and built a model to describe the impact on the number oflive members of botnets with diurnal effect.The impact of side information on the spreading behaviorof network viruses has also been explored. Ganesh et al.[22] thoroughly investigated the effect of network topologyon the spead of epidemics. By combining Graph theory anda SIS (susceptible—infective—susceptible) model, theyfound that if the ratio of cure to infection rates is smallerthan the spectral radius of the graph of the studied network,then the average epidemic lifetime is of order log n, where nis the number of nodes. On the other hand, if the ratio islarger than a generalization of the isoperimetric constant ofthe graph, then the average epidemic lifetime is of order ena,where a is a positive constant. Similarly, Mieghem et al. [4]applied the N-intertwined Markov chain model, an applicationof mean field theory, to analyze the spread of viruses innetworks. They found that tc ¼ 1_maxðAÞ, where tc is the sharpepidemic threshold, and _maxðAÞ is the largest eigenvalue ofthe adjacency matrix A of the studied network. Moreover,there have been many other methodologies to tackle theproblem, such as game theory [23].3 PRELIMINARIESPreliminaries of epidemic modelling and complex networksare presented in this section as this work is mainly based onthe two fields.For the sake of convenience, we summarize the symbolsthat we use in this paper in Table 1.3.1 Deterministic Epidemic ModelsAfter nearly 100 years development, the epidemic models[17] have proved effective and appropriate for a system thatpossesses a large number of vulnerable hosts. In otherwords, they are suitable at a macro level. Zou et al. [15]demonstrated that they were suitable for the studies ofInternet based virus propagation at the early stage.We note that there are many factors that impact the malwarepropagation or botnet membership recruitment, suchas network topology, recruitment frequency, and connectionstatus of vulnerable hosts. All these factors contribute to thespeed of malware propagation. Fortunately, we can includeall these factors into one parameter as infection rate b inepidemic theory. Therefore, in our study, let N be the totalnumber of vulnerable hosts of a large-scale network (e.g., theInternet) for a given malware. There are two statuses for anyone of the N hosts, either infected or susceptible. Let IðtÞ bethe number of infected hosts at time t, then we havedIðtÞdt ¼ bðtÞ½N _ RðtÞ _ IðtÞ _ QðtÞ_IðtÞ _dRðtÞdt; (1)where RðtÞ, and QðtÞ represent the number of removedhosts from the infected population, and the number ofremoved hosts from the susceptible population at time t.The variable bðtÞ is the infection rate at time t.For our study, model (1) is too detailed and not necessaryas we expect to know the propagation and distribution of agiven malware. As a result, we employ the following susceptible-infected model:dIðtÞdt ¼ bIðtÞ½N _ IðtÞ_; (2)where the infection rate b is a constant for a given malwarefor any network.We note that the variable t is continuous in model (2) and(1). In practice, we measure IðtÞ at discrete time points.Therefore, t ¼ 0; 1; 2; . . . . We can interpret each time pointas a new round of malware membership recruitment, suchas vulnerable host scanning. As a result, we can transformmodel (2) into the discrete form as follows:IðtÞ ¼ ð1 þ aDÞIðt _ 1Þ _ bDIðt _ 1Þ2; (3)where t ¼ 0; 1; 2; . . . ; D is the unit of time, Ið0Þ is the initialnumber of infected hosts (we also call them seeds in thispaper), and a ¼ bN, which represents the average numberof vulnerable hosts that can be infected by one infected hostper time unit.In order to simplify our analysis, let D ¼ 1, it could beone second, one minute, one day, or one month, even oneyear, depending on the time scale in a given context. Hence,we have a simpler discrete form given byIðtÞ ¼ ð1 þ aÞIðt _ 1Þ _ bðIðt _ 1ÞÞ2: (4)Based on Equation (4), we define the increase of infectedhosts for each time unit as follows.DIðtÞ , IðtÞ _ Iðt _ 1Þ; t ¼ 1; 2; . . . : (5)To date, many researches are confined to the “earlystage” of an epidemic, such as [15]. Under the early stagecondition, IðtÞ << N, therefore, N _ IðtÞ _ N. As a result,a closed form solution is obtained as follows:IðtÞ ¼ Ið0ÞebNt: (6)TABLE 1Notations of Symbols in This Paper172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015Whenwe take the ln operation on both sides of Equation (6),we haveln IðtÞ ¼ bNt þ ln Ið0Þ: (7)For a given vulnerable network, b, N and Ið0Þ are constants,therefore, the graphical representation of Equation (7)is a straight line.Based on the definition of Equation (5), we obtain theincrease of new members of a malware at the early stage asDIðtÞ ¼ ðebN _ 1ÞIðt _ 1Þ¼ ðebN _ 1ÞIð0ÞebNðt_1Þ: (8)Taking the ln operation on both side of (8), we haveln DIðtÞ ¼ bNðt _ 1Þ þ ln ððebN _ 1ÞIð0ÞÞ: (9)Similar to Equation (7), the graphical representation ofequation (9) is also a straight line. In other words, the numberof recruited members for each round follows an exponentialdistribution at the early stage.We have to note that it is hard for us to know whetheran epidemic is at its early stage or not in practice. Moreover,there is no mathematical definition about the termearly stage.In epidemic models, the infection rate b has a criticalimpact on the membership recruitment progress, and b isusually a small positive number, such as 0.00084 for wormCode Red [12]. For example, for a network with N ¼ 10;000vulnerable hosts, we show the recruited members underdifferent infection rates in Fig. 1. From this diagram, we cansee that the recruitment goes slowly when b ¼ 0:0001, however,all vulnerable hosts have been compromised in lessthan 7 time units when b ¼ 0:0003, and the recruitment progressesin an exponential fashion.This reflects the malware propagation styles in practice.For malware based on “contact”, such as blue tooth contacts,or viruses depending on emails to propagate, theinfection rate is usually small, and it takes a long time tocompromise a large number of vulnerable hosts in a givennetwork. On the other hand, for some malware, which takeactive actions for recruitment, such as vulnerable host scanning,it may take one or a few rounds of scanning to recruitall or a majority of the vulnerable hosts in a given network.We will apply this in the following analysis and performanceevaluation.3.2 Complex NetworksResearch on complex networks have demonstrated that thenumber of hosts of networks follows the power law. Peoplefound that the size distribution usually follows the powerlaw, such as population in cities in a country or personalincome in a nation [24]. In terms of the Internet, researchershave also discovered many power law phenomenon, suchas the size distribution of web files [25]. Recent progressesreported in [26] further demonstrated that the size of networksfollows the power law.The power law has two expression forms: the Pareto distributionand the Zipf distribution. For the same objects ofthe power law, we can use any one of them to represent it.However, the Zipf distributions are tidier than the expressionof the Pareto distributions. In this paper, we will useZipf distributions to represent the power law. The Zipfexpression is as follows:Prfx ¼ ig ¼Cia ; (10)where C is a constant, a is a positive parameter, calledthe Zipf index, Prfx ¼ ig represents the probability of theith ði ¼ 1; 2; . . .P Þ largest object in terms of size, andi Prfx ¼ ig ¼ 1.A more general form of the distribution is called theZipf-Mandelbrot distribution [27], which is defined asfollows:Prfx ¼ ig ¼Cði þ qÞa ; (11)where the additional constant q ðq _ 0Þ is called the plateaufactor, which makes the probability of the highest rankedobjects flat. The Zipf-Mandelbrot distribution becomes theZipf distribution when q ¼ 0.Currently, the metric to say a distribution is a powerlaw is to take the loglog plot of the data, and we usuallysay it is a power law if the result shows a straight line.We have to note that this is not a rigorous method, however,it is widely applied in practice. Power law distributionsenjoy one important property, scale free. We referinterested readers to [28] about the power law and itsproperties.4 PROBLEM DESCRIPTIONIn this section, we describe the malware propagation problemin general.As shown in Fig. 2, we study the malware propagationissue at two levels, the Internet level and the network level.We note that at the network level, a network could bedefined in many different ways, it could be an ISP domain,a country network, the group of a specific mobile devices,and so on. At the Internet level, we treat every network ofthe network level as one element.Fig. 1. The impact from infection rate b on the recruitment progress for agiven vulnerable network with N ¼ 10,000.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 173At the Internet level, we suppose, there are M networks,each network is denoted as Lið1 _ i _ MÞ. For anynetwork Li, we suppose it physically possesses Ni hosts.Moreover, we suppose the possibility of vulnerable hostsof Li is denoted as pið0 _ pi _ 1Þ. In general, it is highlypossible that Ni 6¼ Nj, and pi 6¼ pj for i 6¼ j; 1 _ i; j _ M.Moreover, due to differences in network topology, operatingsystem, security investment and so on, the infectionrates are different from network to network. We denote itas bi for Li. Similarly, it is highly possible that bi 6¼ bj fori 6¼ j; 1 _ i; j _ M.For any given network Li with pi _ Ni vulnerable hostsand infection rate bi. We suppose the malware propagationstarts at time 0. Based on Equation (4), we obtain the numberof infected hosts, IiðtÞ, of Li at time t as follows:IiðtÞ ¼ ð1 þ aiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2¼ ð1 þ bipiNiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2:(12)In this paper, we are interested in a global sense of malwarepropagation. We study the following question.For a given time t since the outbreak of a malware, whatare the characteristics of the number of compromised hostsfor each network in the view of the whole Internet. In otherwords, to find a function F about IiðtÞð1 _ i _ MÞ. Namely,the pattern ofFðI1ðtÞ; I2ðtÞ; . . . ; IMðtÞÞ: (13)For simplicity of presentation, we use SðLi; tÞ to replaceIiðtÞ at the network level, and IðtÞ is dedicated for the Internetlevel. Following Equation (13), for any networkLið1 _ i _ MÞ, we haveSðLi; tÞ ¼ ð1 þ bipiNiÞSðLi; t _ 1Þ _ biðSðLi; t _ 1ÞÞ2: (14)At the Internet level, we suppose there are k1; k2; . . . ; ktnetworks that have been compromised at each round foreach time unit from 1 to t. Any kið1 _ i _ tÞ is decided byEquation (4) as follows:ki ¼ ð1 þ bnMÞIði _ 1Þ _ bnðIði _ 1ÞÞ2; (15)where M is the total number of networks over the Internet,and bn is the infection rate among networks. Moreover,suppose the number of seeds, k0, is known.At this time point t, the landscape of the compromisedhosts in terms of networks is as follows.S_L1k1; t_; S_L2k1; t_; . . . ; S_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1S_L1k2; t _ 1_; S_L2k2; t _ 1_; . . . ; S_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflk2. . .S_L1kt; 1_; S_L2kt; 1_; . . . ; S_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt;(16)where Ljkirepresents the jth network that was compromisedat round i. In other words, there are k1 compromised networks,and each of them have progressed t time units; k2compromised networks, and each of them has progressedt _ 1 time units; and kt compromised networks, and each ofthem have progressed 1 time unit.It is natural to have the total number of compromisedhosts at the Internet level asIðtÞ ¼ S_L1k1; t_þ S_L2k1; t_þ_ _ _þS_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1þ S_L1k2; t _ 1_þ_ _ _þS_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k2þ_ _ _þ S_L1kt; 1_þ S_L2kt; 1_þ_ _ _þS_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt(17)Suppose kiði ¼ 1; 2; . . .Þ follows one distribution with aprobability distribution of pn (n stands for number), andthe size of a compromised network, SðLi; tÞ, followsanother probability distribution of ps (s stands for size).Let pI be the probability distribution of IðtÞðt ¼ 0; 1; . . .Þ.Based on Equation (18), we find pI is exactly the convolutionof pn and ps.pI ¼ pn ps; (18)where is the convolution operation.Our goal is to find a pattern of pI of Equation (18).5 MALWARE PROPAGATION MODELLINGAs shown in Fig. 2, we abstract theM networks of the Internetinto M basic elements in our model. As a result, anytwo large networks, Li and Lj (i 6¼ j), are similar to eachother at this level. Therefore, we can model the studiedproblem as a homogeneous system. Namely, all the M networksshare the same vulnerability probability (denoted asp), and the same infection rate (denoted as b). A simpleway to obtain these two parameters is to use the means:p ¼1MXMi¼1pib ¼1MXMi¼1bi:8>>>><>>>>:(19)Fig. 2. The system architecture of the studied malware propagation.174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015For any network Li, let Ni be the total number of vulnerablehosts, then we haveNi ¼ p _ Ni; i ¼ 1; 2; . . .;M; (20)where Ni is the total number of computers of network Li.As discussed in Section 3, we know that Niði ¼ 1; 2; . . . ;MÞ follows the power law. As p is a constant in Equation(20), then Niði ¼ 1; 2; . . .;MÞ follows the power law as well.Without loss of generality, let Li represent the ith networkin terms of total vulnerable hosts (Ni). Based on the Zipf distribution,if we randomly choose a network X, the probabilitythat it is network Lj isPrfX ¼ Ljg ¼ pzðjÞ ¼N P jMi¼1 Ni ¼Cja : (21)Equation (21) shows clearly that a network with a largernumber of vulnerable hosts has a higher probability to becompromised.Following Equation (18), at time t, we have k1 þ k2 þ_ _ _þkt networks that have been compromised. Combiningwith Equation (21), in general, we know the first round ofrecruitment takes the largest k1 networks, and the secondround takes the k2 largest networks among the remainingnetworks, and so on. We therefore can simplify Equation(18) asIðtÞ ¼Xk1j¼1SðNj; tÞpzðjÞþXk2j¼1SðNk1þj; t _ 1Þpzðk1 þ jÞþ . . .þXktj¼1SðNk1þ___þkt_1þj; 1Þ_ pzðk1 þ_ _ _þkt_1 þ jÞ: (22)From Equation (22), we know the total number of compromisedhosts and their distribution in terms of networksfor a given time point t.6 ANALYSIS ON THE PROPOSED MALWAREPROPAGATION MODELIn this section, we try to extract the pattern of IðtÞ in termsof SðLi; t0 Þ, or pI of Equation (18).We make the following definitions before we progress forthe analysis.1) Early stage. An early stage of the breakout of a malwaremeans only a small percentage of vulnerablehosts have been compromised, and the propagationfollows exponential distributions.2) Final stage. The final stage of the propagation of amalware means that all vulnerable hosts of a givennetwork have been compromised.3) Late stage. A late stage means the time intervalbetween the early stage and the final stage.We note that many researches are focused on the earlystage, and we define the early stage to meet the pervasivelyaccepted condition, we coin the other two terms for theconvenience of our following discussion. Moreover, we setvariable Te as the time point that a malware’s progresstransfers from its early stage to late stage. In terms of mathematicalexpressions, we express the early, late and finalstage as 0 _ t < Te, Te _ t < 1, and t¼1, respectively.Due to the complexity of Equation (22), it is difficult toobtain conclusions in a dynamic style. However, we areable to extract some conclusions under some specialconditions.Lemma 1. If distributions pðxÞ and qðxÞ follow exponential distributions,then pðxÞqðxÞ follows an exponential distributionas well.Due to the space limitation, we skip the proof and referinterested readers to [29].At the early stage of a malware breakout, we have advantagesto obtain a clear conclusion.Theorem 1. For large scale networks, such as the Internet, at theearly stage of a malware propagation, the malware distributionin terms of networks follows exponential distributions.Proof. At a time point of the early stage (0 _ t < Te) of amalware breakout, following Equation (6), we obtain thenumber of compromised networks asIðtÞ ¼ Ið0ÞebnMt: (23)It is clear that IðtÞ follows an exponential distribution.For any of the compromised networks, we suppose ithas progressed t0 ð0 < t0 _ t < Te Þ time units, and itssize isSðLi; t0Þ ¼ Iið0ÞebNit0: (24)Based on Equation (24), we find that the size of anycompromised network follows an exponential distribution.As a result, all the sizes of compromised networksfollow exponential distributions at the early stage.Based on Lemma 1, we obtain that the malware distributionin terms of network follows exponential distributionsat its early stage. tuMoreover, we can obtain concrete conclusion of the propagationof malware at the final stage.Theorem 2. For large scale networks, such as the Internet, at thefinal stage (t¼1) of a malware propagation, the malwaredistribution in terms of networks follows the power lawdistribution.Proof. At the final stage, all vulnerable hosts have beencompromised, namely,SðLi;1Þ ¼ Ni; i ¼ 1; 2; . . .;M:Based on our previous discussion, we know Niði ¼1; 2; . . .;MÞ follows the power law. As a result, the theoremholds. tuNow, we move our study to the late stage of malwarepropagation.Theorem 3. For large scale networks, such as the Internet, at thelate stage (Te _ t < 1) of a malware breakout, the malwaredistribution include two parts: a dominant power law bodyand a short exponential tail.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 175Proof. Suppose a malware propagation has progressed fortðt > > TeÞ time units. Let t0 ¼ t _ Te. If we separate allthe compromised IðtÞ hosts by time point t0, we have twogroups of compromised hosts.Following Theorem 2, as t0 >> Te, the compromisedhosts before t0 follows the power law. At the same time,all the compromised networks after t0 are still in theirearly stage. Therefore, these recently compromised networksfollow exponential distributions.Now, we need to prove that the networks compromisedafter time point t0 are at the tail of the distribution.First of all, for a given network Li, for t1 > t2,we haveSðLi; t1Þ _ SðLi; t2Þ: (25)For two networks, Li and Lj, if Ni _ Nj, then Lishould be compromised earlier than Lj. Combining thiswith (25), we know the later compromised networks usuallylie at the tail of the distribution.Due to the fact that t0 >> Te, the length of the exponentialtail is much shorter than the length of the mainbody of the distribution. tu7 PERFORMANCE EVALUATIONIn this section, we examine our theoretical analysis throughtwo well-known large-scale malware: Android malwareand Conficker. Android malware is a recent fast developingand dominant smartphone based malware [19]. Differentfrom Android malware, the Conficker worm is an Internetbased state-of-the-art botnet [20]. Both the data sets havebeen widely used by the community.From the Android malware data set, we have an overviewof the malware development from August 2010 to October2011. There are 1,260 samples in total from 49 differentAndroid malware in the data set. For a given Android malwareprogram, it only focuses on one or a number of specificvulnerabilities. Therefore, all smartphones share these vulnerabilitiesform a specific network for that Android malware.In other words, there are 49 networks in the data set,and it is reasonable that the population of each network ishuge. We sort the malware subclasses according to their size(number of samples in the data set), and present them in aloglog format in Fig. 3, the diagram is roughly a straight line.In other words, we can say that the Android malware distributionin terms of networks follows the power law.We now examine the growth pattern of total number ofcompromised hosts of Android malware against time,namely, the pattern of IðtÞ. We extract the data from thedata set and present it in Table 2. We further transform thedata into a graph as shown in Fig. 4. It shows that the memberrecruitment of Android malware follows an exponentialdistribution nicely during the 15 months time interval. Wehave to note that our experiments also indicate that thisdata does not fit the power law (we do not show them heredue to space limitation).In Fig. 4, we match a straight line to the real data throughthe least squares method. Based on the data, we can estimatethat the number of seeds (Ið0Þ) is 10, and a ¼ 0:2349.Following our previous discussion, we infer that the propagationof Android malware was in its early stage. It is reasonableas the size of each Android vulnerable network ishuge and the infection rate is quite low (the infection is basicallybased on contacts).We also collected a large data set of Conficker from variousaspects. Due to the space limitation, we can only presenta few of them here to examine our theoretical analysis.First of all, we treat AS as networks in the Internet. Ingeneral, ASs are large scale elements of the Internet. A fewkey statistics from the data set are listed in Table 3. WeFig. 3. The probability distribution of Androidmalware in terms of networks.TABLE 2The Number of Different Android Malware against Time (Months) in 2010-2011Fig. 4. The growth of total compromised hosts by Android malwareagainst time from August 2010 to October 2011.176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015present the data in a loglog format in Fig. 5, which indicatesthat the distribution does follow the power law.A unique feature of the power law is the scale free property.In order to examine this feature, we measure the compromisedhosts in terms of domain names at three differentdomain levels: the top level, level 1, and level 2, respectively.Some statistics of this experiment are listed inTable 4.Once again, we present the data in a loglog format inFigs. 6a, 6b and 6c, respectively. The diagrams show thatthe main body of the three scale measures are roughlystraight lines. In other words, they all fall into power lawdistributions. We note that the flat head in Fig. 6 can beexplained through a Zipf-Mandelbrot distribution. Therefore,Theorem 2 holds.In order to examine whether the tails are exponential, wetake the smallest six data from each tail of the three levels. Itis reasonable to say that they are the networks compromisedat the last 6 time units, the details are listed in Table 5 (wenote that t ¼ 1 is the sixth last time point, and t ¼ 6 is thelast time point).When we present the data of Table 5 into a graph asshown in Fig. 7, we find that they fit an exponential distributionvery well, especially for the level 2 and level 3domain name cases. This experiment confirms our claimin Theorem 3.8 FURTHER DISCUSSIONIn this paper, we have explored the problem of malwaredistribution in large-scale networks. There are many directionsthat could be further explored. We list some importantones as follows.1) The dynamics of the late stage. We have found thatthe main body of malware distribution follows thepower law with a short exponential tail at the latestage. It is very attractive to explore the mathematicalmechanism of how the propagation leads to suchkinds of mixed distributions.2) The transition from exponential distribution topower law distribution. It is necessary to investigatewhen and how a malware distribution moves froman exponential distribution to the power law. Inother words, how can we clearly define the transitionpoint between the early stage and the late stage.3) Multiple layer modelling. We hire the fluid model inboth of the two layers in our study as both layers aresufficiently large and meet the conditions for themodelling methods. In order to improve the accuracyof malware propagation, we may extend ourwork to nðn > 2Þ layers. In another scenario, weTABLE 3Statistics for Conficker Distribution in Terms of ASsFig. 5. Power law distribution of Conficker in terms of autonomousnetworks.TABLE 4Statistics for Conficker Distribution in Terms of DomainNames at the Three Top LevelsFig. 6. Power law distribution of Conficker botnet in the top three levels of domain names.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 177may expect to model a malware distribution for middlesize networks, e.g., an ISP network with manysubnetworks. In these cases, the conditions for thefluid model may not hold. Therefore, we need toseek suitable models to address the problem.4) Epidemic model for the proposed two layer model.In this paper, we use the SI model, which is thesimplest for epidemic analysis. More practical models,e.g., SIS or SIR, could be chosen to serve thesame problem.5) Distribution of coexist multiple malware in networks.In reality, multiple malware may coexist atthe same networks. Due to the fact that different malwarefocus on different vulnerabilities, the distributionsof different malware should not be the same. Itis challenging and interesting to establish mathematicalmodels for multiple malware distribution interms of networks.9 SUMMARY AND FUTURE WORKIn this paper, we thoroughly explore the problem of malwaredistribution at large-scale networks. The solution tothis problem is desperately desired by cyber defenders asthe network security community does not yet have solidanswers. Different from previous modelling methods, wepropose a two layer epidemic model: the upper layerfocuses on networks of a large scale networks, for example,domains of the Internet; the lower layer focuses on the hostsof a given network. *This two layer model improves theaccuracy compared with the available single layer epidemicmodels in malware modelling. Moreover, the proposed twolayer model offers us the distribution of malware in termsof the low layer networks.We perform a restricted analysis based on the proposedmodel, and obtain three conclusions: The distribution for agiven malware in terms of networks follows exponentialdistribution, power law distribution with a short exponentialtail, and power law distribution, at its early, late, andfinal stage, respectively. In order to examine our theoreticalfindings, we have conducted extensive experiments basedon two real-world large-scale malware, and the results confirmour theoretical claims.In regards to future work, we will first further investigatethe dynamics of the late stage. More details of the findingsare expected to be further studied, such as the length of theexponential tail of a power law distribution at the late stage.Second, defenders may care more about their own network,e.g., the distribution of a given malware at their ISPdomains, where the conditions for the two layer model maynot hold. We need to seek appropriate models to addressthis problem. Finally, we are interested in studying the distributionof multiple malware on large-scale networks aswe only focus on one malware in this paper. We believe it isnot a simple linear relationship in the multiple malwarecase compared to the single malware one.ACKNOWLEDGMENTSDr Yu’s work is partially supported by the National NaturalScience Foundation of China (grant No. 61379041), Prof.Stojmenovic’s work is partially supported by NSERCCanada Discovery grant (grant No. 41801-2010), and KAUDistinguished Scientists Program.Shui Yu (M’05-SM’12) received the BEng andMEng degrees from the University of ElectronicScience and Technology of China, Chengdu, P.R. China, in 1993 and 1999, respectively, andthe PhD degree from Deakin University, Victoria,Australia, in 2004. He is currently a senior lecturerwith the School of Information Technology,Deakin University, Victoria, Australia. He haspublished nearly 100 peer review papers, includingtop journals and top conferences, such asIEEE TPDS, IEEE TIFS, IEEE TFS, IEEE TMC,and IEEE INFOCOM. His research interests include networking theory,network security, and mathematical modeling. His actively servers hisresearch communities in various roles, which include the editorial boardsof the IEEE Transactions on Parallel and Distributed Systems, IEEECommunications Surveys and Tutorials, and IEEE Access, IEEE INFOCOMTPC members 2012-2015, symposium co-chairs of IEEE ICC2014, IEEE ICNC 2013-2015, and many different roles of internationalconference organizing committees. He is a senior member of the IEEE,and a member of the AAAS.Guofei Gu (S’06-M’08) received the PhD degreein computer science from the College of Computing,Georgia Institute of Technology. He is anassistant professor in the Department of ComputerScience and Engineering, Texas A&M University(TAMU), College Station, TX. Hisresearch interests are in network and systemsecurity, such as malware analysis, detection,defense, intrusion and anomaly detection, andweb and social networking security. He is currentlydirecting the Secure Communication andComputer Systems (SUCCESS) Laboratory at TAMU. He received the2010 National Science Foundation (NSF) Career Award and a corecipientof the 2010 IEEE Symposium on Security and Privacy (Oakland 10)Best Student Paper Award. He is a member of the IEEE.Ahmed Barnawi received the PhD degree fromthe University of Bradford, United Kingdom in2006. He is an associate professor at the Facultyof Computing and IT, King Abdulaziz University,Jeddah, Saudi Arabia, where he works since2007. He was visiting professor at the Universityof Calgary in 2009. His research areas are cellularand mobile communications, mobile ad hocand sensor networks, cognitive radio networksand security. He received three strategicresearch grants and registered two patents in theUS. He is a member of the IEEE.Song Guo (M’02-SM’11) received the PhDdegree in computer science from the Universityof Ottawa, Canada in 2005. He is currently asenior associate professor at the School of ComputerScience and Engineering, the University ofAizu, Japan. His research interests are mainly inthe areas of protocol design and performanceanalysis for reliable, energy-efficient, and costeffective communications in wireless networks.He is an associate editor of the IEEE Transactionson Parallel and Distributed Systems and aneditor of Wireless Communications and Mobile Computing. He is asenior member of the IEEE and the ACM.Ivan Stojmenovic was editor-in-chief of theIEEE Transactions on Parallel and DistributedSystems (2010-3), and is founder of three journals.He is editor of the IEEE Transactions onComputers, IEEE Network, IEEE Transactionson Cloud Computing, and ACM Wireless Networksand steering committee member of theIEEE Transactions on Emergent Topics in Computing.He is on Thomson Reuters list of HighlyCited Researchers from 2013, has top h-index inCanada for mathematics and statistics, and hasmore than 15,000 citations. He received five Best Paper Awards. He is afellow of the IEEE, Canadian Academy of Engineering and AcademiaEuropaea. He has received the Humboldt Research Award.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 179

Lossless and Reversible Data Hiding in Encrypted Images with Public Key Cryptography

05/08/201902/07/2019 by admin

Abstract—This paper proposes a lossless, a reversible, and a combined data hiding schemes for ciphertext images encrypted by public key cryptosystems with probabilistic and homomorphic properties. In the lossless scheme, the ciphertext pixels are replaced with new values to embed the additional data into several LSB-planes of ciphertext pixels by multi-layer wet paper coding. Then, the embedded data can be directly extracted from the encrypted domain, and the data embedding operation does not affect the decryption of original plaintext image. In the reversible scheme, a preprocessing is employed to shrink the image histogram before image encryption, so that the modification on encrypted images for data embedding will not cause any pixel oversaturation in plaintext domain. Although a slight distortion is introduced, the embedded data can be extracted and the original image can be recovered from the directly decrypted image. Due to the compatibility between the lossless and reversible schemes, the data embedding operations in the two manners can be simultaneously performed in an encrypted image. With the combined technique, a receiver may extract a part of embedded data before decryption, and extract another part of embedded data and recover the original plaintext image after decryption.

Index Terms—reversible data hiding, lossless data hiding, image encryption

I. INTRODUCTION

ncryption and data hiding are two effective means of data protection. While the encryption techniques convert plaintext content into unreadable ciphertext, the data hiding techniques embed additional data into cover media by introducing slight modifications. In some distortion-unacceptable scenarios, data hiding may be performed with a lossless or reversible manner. Although the terms “lossless” and “reversible” have a same meaning in a set of previous references, we would distinguish them in this work.

We say a data hiding method is lossless if the display of cover signal containing embedded data is same as that of original cover even though the cover data have been modified for data embedding. For example, in [1], the pixels with the most used color in a palette image are assigned to some unused color indices for carrying the additional data, and these indices are redirected to the most used color. This way, although the indices of these pixels are altered, the actual colors of the pixels are kept unchanged. On the other hand, we say a data hiding method is reversible if the original cover content can be perfectly recovered from the cover version containing embedded data even though a slight distortion has been introduced in data embedding procedure. A number of mechanisms, such as difference expansion [2], histogram shift [3] and lossless compression [4], have been employed to develop the reversible data hiding techniques for digital images. Recently, several good prediction approaches [5] and optimal transition probability under payload-distortion criterion [6, 7] have been introduced to improve the performance of reversible data hiding.

Combination of data hiding and encryption has been studied in recent years. In some works, data hiding and encryption are jointed with a simple manner. For example, a part of cover data is used for carrying additional data and the rest data are encrypted for privacy protection [8, 9]. Alternatively, the additional data are embedded into a data space that is invariable to encryption operations [10]. In another type of the works, data embedding is performed in encrypted domain, and an authorized receiver can recover the original plaintext cover image and extract the embedded data. This technique is termed as reversible data hiding in encrypted images (RDHEI). In some scenarios, for securely sharing secret images, a content owner may encrypt the images before transmission, and an inferior assistant or a channel administrator hopes to append some additional messages, such as the origin information, image notations or authentication data, within the encrypted images though he does not know the image content. For example, when medical images have been encrypted for protecting the patient privacy, a database administrator may aim to embed the personal information into the corresponding encrypted images. Here, it may be hopeful that the original content can be recovered without any error after decryption and retrieve of additional message at receiver side. In [11], the original image is encrypted by an exclusive-or operation with pseudo-random bits, and then the additional data are embedded by flipping a part of least significant bits (LSB) of encrypted image. By exploiting the spatial correlation in natural images, the embedded data and the original content can be retrieved at receiver side. The performance of RDHEI can be further

Lossless and Reversible Data Hiding in Encrypted Images with Public Key Cryptography

Xinpeng Zhang, Jing Long, Zichi Wang, and Hang Cheng 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

improved by introducing an implementation order [12] or a flipping ratio [13]. In [14], each additional bit is embedded into a block of data encrypted by the Advanced Encryption Standard (AES). When a receiver decrypts the encrypted image containing additional data, however, the quality of decrypted image is significantly degraded due to the disturbance of additional data. In [15], the data-hider compresses the LSB of encrypted image to generate a sparse space for carrying the additional data. Since only the LSB is changed in the data embedding phase, the quality of directly decrypted image is satisfactory. Reversible data hiding schemes for encrypted JPEG images is also presented [16]. In [17], a sparse data space for accommodating additional data is directly created by compress the encrypted data. If the creation of sparse data space or the compression is implemented before encryption, a better performance can be achieved [18, 19].

While the additional data are embedded into encrypted images with symmetric cryptosystem in the above-mentioned RDHEI methods, a RDHEI method with public key cryptosystem is proposed in [20]. Although the computational complexity is higher, the establishment of secret key through a secure channel between the sender and the receiver is needless. With the method in [20], each pixel is divided into two parts: an even integer and a bit, and the two parts are encrypted using Paillier mechanism [21], respectively. Then, the ciphertext values of the second parts of two adjacent pixels are modified to accommodate an additional bit. Due to the homomorphic property of the cryptosystem, the embedded bit can be extracted by comparing the corresponding decrypted values on receiver side. In fact, the homomorphic property may be further exploited to implement signal processing in encrypted domain [22, 23, 24]. For recovering the original plaintext image, an inverse operation to retrieve the second part of each pixel in plaintext domain is required, and then two decrypted parts of each pixel should be reorganized as a pixel.

This paper proposes a lossless, a reversible, and a combined data hiding schemes for public-key-encrypted images by exploiting the probabilistic and homomorphic properties of cryptosystems. With these schemes, the pixel division/reorganization is avoided and the encryption/decryption is performed on the cover pixels directly, so that the amount of encrypted data and the computational complexity are lowered. In the lossless scheme, due to the probabilistic property, although the data of encrypted image are modified for data embedding, a direct decryption can still result in the original plaintext image while the embedded data can be extracted in the encrypted domain. In the reversible scheme, a histogram shrink is realized before encryption so that the modification on encrypted image for data embedding does not cause any pixel oversaturation in plaintext domain. Although the data embedding on encrypted domain may result in a slight distortion in plaintext domain due to the homomorphic property, the embedded data can be extracted and the original content can be recovered from the directly decrypted image. Furthermore, the data embedding operations of the lossless and the reversible schemes can be simultaneously performed in an encrypted image. With the combined technique, a receiver may extract a part of embedded data before decryption, and extract another part of embedded data and recover the original plaintext image after decryption.

II. LOSSLESS DATA HIDING SCHEME

In this section, a lossless data hiding scheme for public-key-encrypted images is proposed. There are three parties in the scheme: an image provider, a data-hider, and a receiver. With a cryptosystem possessing probabilistic property, the image provider encrypts each pixel of the original plaintext image using the public key of the receiver, and a data-hider who does not know the original image can modify the ciphertext pixel-values to embed some additional data into the encrypted image by multi-layer wet paper coding under a condition that the decrypted values of new and original cipher-text pixel values must be same. When having the encrypted image containing the additional data, a receiver knowing the data hiding key may extract the embedded data, while a receiver with the private key of the cryptosystem may perform decryption to retrieve the original plaintext image. In other words, the embedded data can be extracted in the encrypted domain, and cannot be extracted after decryption since the decrypted image would be same as the original plaintext image due to the probabilistic property. That also means the data embedding does not affect the decryption of the plaintext image. The sketch of lossless data hiding scheme is shown in Figure 1.

Data extraction

Decryption

Additional data

Data embedding

Image encryption

Original image

Additional data

Receiver

Encrypted image containing embedded data

Data-hider

Image provider

Encrypted image

Figure 1. Sketch of lossless data hiding scheme for public-key-encrypted images 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

A. Image encryption

In this phase, the image provider encrypts a plaintext image using the public key of probabilistic cryptosystem pk. For each pixel value m(i, j) where (i, j) indicates the pixel position, the image provider calculates its ciphertext value,

(1) ()()()[]jirjimpEjick,,,,,=

where E is the encryption operation and r(i, j) is a random value. Then, the image provider collects the ciphertext values of all pixels to form an encrypted image.

Actually, the proposed scheme is capitable with various probabilistic public-key cryptosystems, such as Paillier [18] and Damgard-Jurik cryptosystems [25]. With Paillier cryptosystem [18], for two large primes p and q, calculate n = p⋅q, λ = lcm (p−1, q−1), where lcm means the least common multiple. Here, it should meet that gcd (n, (p−1)⋅(q−1)) = 1, where gcd means the greatest common divisor. The public key is composed of n and a randomly selected integer g in , while the private key is composed of λ and 2*nZ

(2) ()()nngLmodmod12−=λμ

where

(3) ()()nxxL1−=

In this case, (1) implies

(4) ()()()()2,mod,,njirgjicnjim⋅=

where r(i, j) is a random integer in Z*n. The plaintext pixel value can be obtained using the private key,

(5) ()()()()nnjicLjimmodmod,,2μλ⋅=

As a generalization of Paillier cryptosystem, Damgard-Jurik cryptosystem [25] can be also used to encrypt the plaintext image. Here, the public key is composed of n and an element g in such that g = (1+n)j⋅x mod ns+1 for a known j relatively prime to n and x belongs to a group isomorphic to Z*n, and we may choose d as the private key when meeting d mod n ∈ Z*n and d = 0 mod λ. Then, the encryption in (1) can be rewritten as1*+snZ

(6) ()()()()1,mod,,+⋅=snjimnjirgjics

where r(i, j) is a random integer in . By applying a recursive version of Paillier decryption, the plaintext value can be obtained from the ciphertext value using the private key. Note that, because of the probabilistic property of the two cryptosystems, the same gray values at different positions may correspond to different ciphertext values. 1*+snZ

B. Data embedding

When having the encrypted image, the data-hider may embed some additional data into it in a lossless manner. The pixels in the encrypted image are reorganized as a sequence according to the data hiding key. For each encrypted pixel, the data-hider selects a random integer r’(i, j) in Z*n and calculates

(7) ()()()()2mod,’,,’njirjicjicn⋅=

if Paillier cryptosystem is used for image encryption, while the data-hider selects a random integer r’(i, j) in and calculates 1*+snZ

(8) ()()()()1mod,’,,’+⋅=ssnnjirjicjic

if Damgard-Jurik cryptosystem is used for image encryption. We denote the binary representations of c(i, j) and c’(i, j) as bk(i, j) and b’k(i, j), respectively,

(9) ()()…,2,1,2mod2,,1==−kjicjibkk

(10) ()()…,2,1,2mod2,’,’1==−kjicjibkk

Clearly, the probability of bk(i, j) = b’k(i, j) (k = 1, 2, …) is 1/2. We also define the sets

()()(){}()()()()(){}()()()()(){}1…,,2,1,,’,,,’,|,,’,,,’,|,,’,|,11222111−==≠==≠=≠=KkjibjibjibjibjiSjibjibjibjibjiSjibjibjiSkkKKK

(11)

By viewing the k-th LSB of encrypted pixels as a wet paper channel (WPC) [26] and the k-th LSB in Sk as “dry” elements of the wet paper channel, the data-hider may employ the wet paper coding [26] to embed the additional data by replacing a part of c(i, j) with c’(i, j). The details will be given in the following.

Considering the first LSB, if c(i, j) are replaced with c’(i, j), the first LSB in S1 would be flipped and the rest first LSB would be unchanged. So, the first LSB of the encrypted pixels can be regarded as a WPC, which includes changeable (dry) elements and unchangeable (wet) elements. In other words, the first LSB in S1 are dry elements and the rest first LSB are wet positions. By using the wet paper coding [26], one can represent on average Nd bits by only flipping a part of dry elements where Nd is the number of dry elements. In this scenario, the data-hider may flip the dry elements by replacing c(i, j) with c’(i, j). Denoting the number of pixels in the image as N, the data-hider may embed on average N/2 bits in the first LSB-layer using wet paper coding.

Considering the second LSB (SLSB) layer, we call the SLSB in S2 as dry elements and the rest SLSB as wet elements. Note that the first LSB of ciphertext pixels in S1 have been determined by replacing c(i, j) with c’(i, j) or keeping c(i, j) unchanged in the first LSB-layer embedding, meaning that the SLSB in S1 are unchangeable in the second layer. Then, the data-hider may flip a part of SLSB in S2 by replacing c(i, j) with c’(i, j) to embed on average N/4 bits using wet paper coding.

Similarly, in the k-th LSB layer, the data-hider may flip a part of k-th LSB in Sk to embed on average N/2k bits. When the data embedding is implemented in K layers, the total N⋅(1−1/2K) bits, on average, are embedded. That implies the embedding rate, a ratio between the number of embedded bits and the number of pixels in cover image, is approximately (1−1/2K). That implies the upper bound of the embedding rate is 1 bit per pixel. The next subsection will show that, although a part of c(i, j) is replaced with c’(i, j), the original plaintext image can still be obtained by decryption.1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

C. Data extraction and image decryption

After receiving an encrypted image containing the additional data, if the receiver knows the data-hiding key, he may calculate the k-th LSB of encrypted pixels, and then extract the embedded data from the K LSB-layers using wet paper coding. On the other hand, if the receiver knows the private key of the used cryptosystem, he may perform decryption to obtain the original plaintext image. When Paillier cryptosystem is used, Equation (4) implies

(12) ()()()()2,,,njirgjicnjim⋅+⋅=α

where α is an integer. By substituting (12) into (7), there is

(13) ()()()()()2,mod,’,,’njirjirgjicnjim⋅⋅=

Since r(i, j)⋅r’(i, j) can be viewed as another random integer in Z*n, the decryption on c’(i, j) will result in the plaintext value,

(14) ()()()()nnjicLjimmodmod,’,2μλ⋅=

Similarly, when Damgard-Jurik cryptosystem is used,

(15) ()()()()()1,mod,’,,’+⋅⋅=snjimnjirjirgjics

The decryption on c’(i, j) will also result in the plaintext value. In other words, the replacement of ciphertext pixel values for data embedding does not affect the decryption result.

III. REVERSIBLE DATA HIDING SCHEME

This section proposes a reversible data hiding scheme for public-key-encrypted images. In the reversible scheme, a preprocessing is employed to shrink the image histogram, and then each pixel is encrypted with additive homomorphic cryptosystem by the image provider. When having the encrypted image, the data-hider modifies the ciphertext pixel values to embed a bit-sequence generated from the additional data and error-correction codes. Due to the homomorphic property, the modification in encrypted domain will result in slight increase/decrease on plaintext pixel values, implying that a decryption can be implemented to obtain an image similar to the original plaintext image on receiver side. Because of the histogram shrink before encryption, the data embedding operation does not cause any overflow/underflow in the directly decrypted image. Then, the original plaintext image can be recovered and the embedded additional data can be extracted from the directly decrypted image. Note that the data-extraction and content-recovery of the reversible scheme are performed in plaintext domain, while the data extraction of the previous lossless scheme is performed in encrypted domain and the content recovery is needless. The sketch of reversible data hiding scheme is given in Figure 2.

Decryption

Histogram shrink

Data extraction & image recovery

Data embedding

Image encryption

Original image

Additional data

Receiver

Data-hider

Image provider

Encrypted image

Additional data

Encrypted image containing embedded data

Figure 2. Sketch of reversible data hiding scheme for public-key-encrypted images

A. Histogram shrink and image encryption

In the reversible scheme, a small integer δ shared by the image provider, the data-hider and the receiver will be used, and its value will be discussed later. Denote the number of pixels in the original plaintext image with gray value v as hv, implying

(16) Nhvv=Σ=2550

where N is the number of all pixels in the image. The image provider collects the pixels with gray values in [0, δ+1], and represent their values as a binary stream BS1. When an efficient lossless source coding is used, the length of BS1

(17) ⋅≈ΣΣΣΣ+=++=+=+=101101100101,…,,δδδδδvvvvvvvvhhhhhhHhl

where H(⋅) is the entropy function. The image provider also collects the pixels with gray values in [255−δ, 255], and represent their values as a binary stream BS2 with a length l2. Similarly,

(18) ⋅≈ΣΣΣΣ−=−=+−−=−−=25525525525525512552552552552552552,…,,δδδδδδvvvvvvvvhhhhhhHhl

Then, the gray values of all pixels are enforced into [δ+1, 255−δ],

()()()()()+≤+−<<+−≥−=1,if,1255,1if,,255,if,255,δδδδδδjimjimjimjimjimS

(19)

Denoting the new histogram as h’v, there must be 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

(20) −>−=−<<++=≤=ΣΣ−=+=δδδδδδδδ255,0255,2551,1,,0’25525510vvhvhvhvhvvvvvv

The image provider finds the peak of the new histogram,

(21) vvhV‘maxarg2551δδ−≤≤+=

The image provider also divides all pixels into two sets: the first set including (N−8) pixels and the second set including the rest 8 pixels, and maps each bit of BS1, BS2 and the LSB of pixels in the second set to a pixel in the first set with gray value V. Since the gray values close to extreme black/white are rare, there is

(22) 16’21++≥llhV

when δ is not too large. In this case, the mapping operation is feasible. Here, 8 pixels in the second set cannot be used to carry BS1/BS2 since their LSB should be used to carry the value of V, while 8 pixels in the first set cannot be used to carry BS1/BS2 since their LSB should be used to carry the original LSB of the second set. So, a total of 16 pixels cannot be used for carrying BS1/BS2. That is the reason that there is a value 16 in (22). The experimental result on 1000 natural images shows (22) is always right when δ is less than 15. So, we recommend the parameter δ < 15. Then, a histogram shift operation is made,

()()()()()()()<−=−=>=VjimjimVjimVVjimVVjimjimjimSSSSSST,if,1,1isbitingcorrespondtheand,if,10isbitingcorrespondtheand,if,,if,,,

(23)

In other word, BS1, BS2 and the LSB of pixels in the second set are carried by the pixels in the first set. After this, the image provider represents the value of V as 8 bits and maps them to the pixels in the second set in a one-to-one manner. Then, the values of pixels in the second set are modified as follows,

()()()()()−=bitingcorrespondthefrom differs,ofLSBif,1,bitingcorrespondtheas same is,ofLSBif,,,jimjimjimjimjimSSSST

(24)

That means the value of V is embedded into the LSB of the second set. This way, all pixel values must fall into [δ, 255−δ].

At last, the image provider encrypts all pixels using a public key cryptosystem with additive homomorphic property, such as Paillier and Damgard-Jurik cryptosystems. When Paillier cryptosystem is used, the ciphertext pixel is

(25) ()()()()2,mod,,njirgjicnjimT⋅=

And, when Damgard-Jurik cryptosystem is used, the ciphertext pixel is

(26) ()()()()1,mod,,+⋅=snjimnjirgjicsT

Then, the ciphertext values of all pixels are collected to form an encrypted image.

B. Data embedding

With the encrypted image, the data-hider divides the ciphertext pixels into two set: Set A including c(i, j) with odd value of (i+j), and Set B including c(i, j) with even value of (i+j). Without loss of generality, we suppose the pixel number in Set A is N/2. Then, the data-hider employs error-correction codes expand the additional data as a bit-sequence with length N/2, and maps the bits in the coded bit-sequence to the ciphertext pixels in Set A in a one-to-one manner, which is determined by the data-hiding key. When Paillier cryptosystem is used, if the bit is 0, the corresponding ciphertext pixel is modified as

(27) ()()()()2mod,’,,’njirgjicjicnn⋅⋅=−δ

where r’(i, j) is a integer randomly selected in Z*n. If the bit is 1, the corresponding ciphertext pixel is modified as

(28) ()()()()2mod,’,,’njirgjicjicn⋅⋅=δ

When Damgard-Jurik cryptosystem is used, if the bit is 0, the corresponding ciphertext pixel is modified as

(29) ()()()()1mod,’,,’1+−⋅⋅=+snnnjirgjicjicssδ

where r’(i, j) is a integer randomly selected in . If the bit is 1, the corresponding ciphertext pixel is modified as 1*+snZ

(30) ()()()()1mod,’,,’+⋅⋅=snnjirgjicjicsδ

This way, an encrypted image containing additional data is produced. Note that the additional data are embedded into Set A. Although the pixels in Set B may provide side information of the pixel-values in Set A, which will be used for data extraction, the pixel-values in Set A are difficult to be precisely obtained on receiver side, leading to possible errors in directly extracted data. So, the error-correction coding mechanism is employed here to ensure successful data extraction and perfect image recovery.

C. Image decryption, data extraction and content recovery

After receiving an encrypted image containing additional data, the receiver firstly performs decryption using his private key. We denote the decrypted pixels as m’(i, j). Due to the homomorphic property, the decrypted pixel values in Set A meet

()()()−+=0isbitingcorrespondtheif,,1isbitingcorrespondtheif,,,’δδjimjimjimTT

(31)

On the other hand, the decrypted pixel values in Set B are just mT(i, j) since their ciphertext values are unchanged in data embedding phase. When δ is small, the decrypted image is perceptually similar to the original plaintext image.

Then, the receiver with the data-hiding key can extract the embedded data from the directly decrypted image. He estimates the pixel values in Set A using their neighbors,

()()()()()41,,11,,1,++++−+−=jimjimjimjimjimTTTTT

(32)

and obtain an estimated version of the coded bit-sequence by comparing the decrypted and estimated pixel values in Set A. That means the coded bit is estimated as 0 if or as 1 if . With the estimate of coded ()()jimjimT,’,> ()()jimjimT,’,≤1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

bit-sequence, the receiver may employ the error-correction method to retrieve the original coded bit-sequence and the embedded additional data. Note that, with a larger δ, the error rate in the estimate of coded bits would be lower, so that more additional data can be embedded when ensuring successful error correction and data extraction. In other words, a smaller δ would result in a higher error rate in the estimate of coded bits, so that the error correction may be unsuccessful when excessive payload is embedded. That means the embedding capacity of the reversible data hiding scheme is depended on the value of δ.

After retrieving the original coded bit-sequence and the embedded additional data, the original plaintext image may be further recovered. For the pixels in Set A, mT(i, j) are retrieved according to the coded bit-sequence,

()()()+−=0isbitingcorrespondtheif,,’1isbitingcorrespondtheif,,’,δδjimjimjimT

(33)

For the pixels in Set B, as mentioned above, mT(i, j) are just m’(i, j). Then, divides all mT(i, j) into two sets: the first one including (N−8) pixels and the second one including the rest 8 pixels. The receiver may obtain the value of V from the LSB in the second set, and retrieve mS(i, j) of the first set,

(34) ()()()()()()−<+−=>=1,if,1,1or,if,,if,,,VjimjimVVjimVVjimjimjimTTTTTS

Meanwhile, the receiver extracts a bit 0 from a pixel with mT(i, j) = V and a bit 1 from a pixel with mT(i, j) = V−1. After decomposing the extracted data into BS1, BS2 and the LSB of mS(i, j) in the second set, the receiver retrieves mS(i, j) of the second set,

()()()()()()()+=differentare,and,ofLSBif,1,sameare, and,ofLSBif,,,jimjimjimjimjimjimjimTSTTSTS

(35)

Collect all pixels with mS(i, j) = δ+1, and, according to BS1, recover their original values within [0, δ+1]. Similarly, the original values of pixels with mS(i, j) = 255−δ are recovered within [255−δ, 255] according to BS2. This way, the original plaintext image is recovered.

IV. COMBINED DATA HIDING SCHEME

As described in Sections 3 and 4, a lossless and a reversible data hiding schemes for public-key-encrypted images are proposed. In both of the two schemes, the data embedding operations are performed in encrypted domain. On the other hand, the data extraction procedures of the two schemes are very different. With the lossless scheme, data embedding does not affect the plaintext content and data extraction is also performed in encrypted domain. With the reversible scheme, there is slight distortion in directly decrypted image caused by data embedding, and data extraction and image recovery must be performed in plaintext domain. That implies, on receiver side, the additional data embedded by the lossless scheme cannot be extracted after decryption, while the additional data embedded by the reversible scheme cannot extracted before decryption. In this section, we combine the lossless and reversible schemes to construct a new scheme, in which data extraction in either of the two domains is feasible. That means the additional data for various purposes may be embedded into an encrypted image, and a part of the additional data can be extracted before decryption and another part can be extracted after decryption.

In the combined scheme, the image provider performs histogram shrink and image encryption as described in Subsection 3.A. When having the encrypted image, the data-hider may embed the first part of additional data using the method described in Subsection 3.B. Denoting the ciphertext pixel values containing the first part of additional data as c’(i, j), the data-hider calculates

(36) ()()()()2mod,”,’,”njirjicjicn⋅=

(37) ()()()()1mod,”,’,”+⋅=snnjirjicjics

where r”(i, j) are randomly selected in Z*n or for Paillier and Damgard-Jurik cryptosystems, respectively. Then, he may employ wet paper coding in several LSB-planes of ciphertext pixel values to embed the second part of additional data by replacing a part of c’(i, j) with c”(i, j). In other words, the method described in Subsection 2.B is used to embed the second part of additional data. On receiver side, the receiver firstly extracts the second part of additional data from the LSB-planes of encrypted domain. Then, after decryption with his private key, he extracts the first part of additional data and recovers the original plaintext image from the directly decrypted image as described in Subsection 3.C. The sketch of the combined scheme is shown in Figure 3. Note that, since the reversibly embedded data should be extracted in the plaintext domain and the lossless embedding does not affect the decrypted result, the lossless embedding should implemented after the reversible embedding in the combined scheme.1*+snZ 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

Data extraction in encrypted domain

Lossless data embedding

Decryption

Histogram shrink

Data extraction & image recovery in plaintext domain

Reversible data embedding

Image encryption

Original image

Receiver

Data-hider

Image provider

Encrypted image

Additional data 1

Encrypted image containing embedded data

Additional data 2

Additional data 1

Additional data 2

Original image

Figure 3. Sketch of combined scheme

V. EXPERIMENTAL RESULTS

Four gray images sized 512×512, Lena, Man, Plane and Crowd, shown in Figure 4, and 50 natural gray images sized 1920×2560, which contain landscape and people, were used as the original plaintext covers in the experiment. With the lossless scheme, all pixels in the cover images were firstly encrypted using Paillier cryptosystem, and then the additional data were embedded into the LSB-planes of ciphertext pixel-values using multi-layer wet paper coding as in Subsection 2.B. Table 1 lists the average value of embedding rates when K LSB-planes were used for carrying the additional data in the 54 encrypted images. In fact, the average embedding rate is very close to (1−1/2K). On receiver side, the embedded data can be extracted from the encrypted domain. Also, the original plaintext images can be retrieved by direct decryption. In other word, when the decryption was performed on the encrypted images containing additional data, the original plaintext images were obtained.

With the reversible scheme, all pixels were encrypted after histogram shrink as in Subsection 3.A. Then, a half of ciphertext pixels were modified to carry the additional data as in Subsection 3.B, and after decryption, we implemented the data extraction and image recovery in the plaintext domain. Here, the low-density parity-check (LDPC) coding was used to expand the additional data as a bit-sequence in data embedding phase, and to retrieve the coded bit-sequence and the embedded additional data on receiver side. Although the error-correction mechanism was employed, an excessive payload may cause the failure of data extraction and image recovery. With a larger value of δ, a higher embedding capacity could be ensured, while a higher distortion would be introduced into the directly decrypted image. For instance, when using Lena as the cover and δ = 4, a total of 4.6×104 bits were embedded and the value of PSNR in directly decrypted image was 40.3 dB. When using δ = 7, a total of 7.7×104 bits were embedded and the value of PSNR in directly decrypted image was 36.3 dB. In both of the two cases, the embedded additional data and the original plaintext image were extracted and recovered without any error. Figure 5 gives the two directly decrypted images. Figure 6 shows the rate-distortion curves generated from different cover images and various values of δ under the condition of successful data-extraction/image-recovery. The abscissa represents the pure embedding rate, and the ordinate is the PSNR value in directly decrypted image. The rate-distortion curves on four test images, Lena, Man, Plane and Crowd, are given in Figures 6, respectively. We also used 50 natural gray images sized 1920×2560 as the original plaintext covers, and calculated the average values of embedding rates and PSNR values, which are also shown as a curve marked by asterisks in the figure. Furthermore, Figure 7 compares the average rate-PSNR performance between the proposed reversible scheme with public-key cryptosystems and several previous methods with symmetric cryptosystems under a condition that the original plaintext image can be recovered without any error using the data-hiding and encryption keys. In [11] and [12], each block of encrypted image with given size is used to carry one additional bit. So, the embedding rates of the two works are fixed and low. With various parameters, we obtain the performance curves of the method in [15] and the proposed reversible scheme, which are shown in the figure. It can be seen that the proposed reversible scheme significantly outperforms the previous methods when the embedding rate is larger than 0.01 bpp. With the combined scheme, we implemented the histogram shrink operation with a value of parameter δ, and encrypted the 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems

Joint Beamforming, Power and Channel Allocation in Multi-User and Multi-Channel Underlay MISO Cogniti

05/08/201902/07/2019 by admin

Efficient Top-k Retrieval on Massive Data

05/08/201902/07/2019 by admin

Abstract:

Top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed.

Introduction:

Top-k query is an important operation to return a set of interesting points from a potentially huge data space. In top-k query, a ranking function F is provided to determine the score of each tuple and k tuples with the largest scores are returned. Due to its practical importance, top-k query has attracted extensive attention proposes a novel table-scan-based T2S algorithm (Top-k by Table Scan) to compute top-k results on massive data efficiently.

The analysis of scan depth in T2S is developed also. The result size k is usually small and the vast majority of the tuples retrieved in PT are not top-k results, this paper devises selective retrieval to skip the tuples in PT which are not query results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly.

The construction and incremental-update/batch-processing methods for the data structures are proposed in this paper. The extensive experiments are conducted on synthetic and real life data sets.

Existing System:

To its practical importance, top-k query has attracted extensive attention. The existing top-k algorithms can be classified into three types: indexbased methods view-based methods and sorted-list-based methods . Index-based methods (or view-based methods) make use of the pre-constructed indexes or views to process top-k query.

A concrete index or view is constructed on a specific subset of attributes, the indexes or views of exponential order with respect to attribute number have to be built to cover the actual queries, which is prohibitively expensive. The used indexes or views can only be built on a small and selective set of attribute combinations.

Sorted-list-based methods retrieve the sorted lists in a round-robin fashion, maintain the retrieved tuples, update their lower-bound and upper-bound scores. When the kth largest lower-bound score is not less than the upper-bound scores of other candidates, the k candidates with the largest lower-bound scores are top-k results.

Sorted-list-based methods compute topk results by retrieving the involved sorted lists and naturally can support the actual queries. However, it is analyzed in this paper that the numbers of tuples retrieved and maintained in these methods increase exponentially with attribute number, increase polynomially with tuple number and result size.

Disadvantages:

Computational Overhead.
Data redundancy is more.
Time consuming process.

Problem Definition:

Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, computational advertising (online ad placement).

Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used only the top few documents, retrieved by some existing ranking models are checked.

Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used.

Proposed System:

Our proposed system describe with layered indexing to organize the tuples into multiple consecutive layers. The top-k results can be computed by at most k layers of tuples. Also our propose layer-based Pareto-Based Dominant Graph to express the dominant relationship between records and top-k query is implemented as a graph traversal problem.

Then propose a dual-resolution layer structure. Top k query can be processed efficiently by traversing the dual-resolution layer through the relationships between tuples. propose the Hybrid- Layer Index, which integrates layer level filtering and list-level filtering to significantly reduce the number of tuples retrieved in query processing propose view-based algorithms to pre-construct the specified materialized views according to some ranking functions.

Given a top-k query, one or more optimal materialized views are selected to return the top-k results efficiently. Propose LPTA+ to significantly improve efficiency of the state-of-the-art LPTA algorithm. The materialized views are cached in memory; LPTA+ can reduce the iterative calling of the linear programming sub-procedure, thus greatly improving the efficiency over the LPTA algorithm. In practical applications, a concrete index (or view) is built on a specific subset of attributes. Due to prohibitively expensive overhead to cover all attribute combinations, the indexes (or views) can only be built on a small and selective set of attribute combinations.

If the attribute combinations of top-k query are fixed, index-based or viewbased methods can provide a superior performance. However, on massive data, users often issue ad-hoc queries, it is very likely that the indexes (or views) involved in the ad-hoc queries are not built and the practicability of these methods is limited greatly.

Correspondingly, T2S only builds presorted table, on which top-k query on any attribute combination can be dealt with. This reduces the space overhead significantly compared with index-based (or view-based) methods, and enables actual practicability for T2S.

Advantages:

The evaluation of an information retrieval system is the process of assessing how well a system meets the information needs of its users.
Traditional evaluation metrics, designed for Boolean retrieval or top-k retrieval, include precision and recall.
All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query.

Modules:

Multi-keyword ranked search:

To design search schemes which allow multi-keyword query and provide result similarity ranking for effective data retrieval, instead of returning undifferentiated results.

Privacy-preserving:

To prevent the cloud server from learning additional information from the data set and the index, and to meet privacy requirements. if the cloud server deduces any association between keywords and encrypted documents from index, it may learn the major subject of a document, even the content of a short document. Therefore, the searchable index should be constructed to prevent the cloud server from performing such kind of association attack.

Efficiency:

Above goals on functionality and privacy should be achieved with low communication and computation overhead. Assume the number of query keywords appearing in a document the final similarity score is a linear function of xi, where the coefficient r is set as a positive random number. However, because the random factor “i is introduced as a part of the similarity score, the final search result on the basis of sorting similarity scores may not be as accurate as that in original scheme. For the consideration of search accuracy, we can let follow a normal distribution where the standard deviation functions as a flexible tradeoff parameter among search accuracy and security.

Conclusion:

The proposed novel T2S algorithm successfully implemented and to efficiently return top-k results on massive data by sequentially scanning the presorted table, in which the tuples are arranged in the order of round-robin retrieval on sorted lists. Only fixed number of candidates needs to be maintained in T2S. This paper proposes early termination checking and the analysis of the scan depth. Selective retrieval is devised in T2S and it is analyzed that most of the candidates in the presorted table can be skipped. The experimental results show that T2S significantly outperforms the existing algorithm.

Future Enhancement:

In future development of Multi keyword ranked search scheme should explore checking the integrity of the rank order in the search result from the un trusted network server infrastructure.

Feature Enhancement:

A novel table-scan-based T2S algorithm implemented successfully to compute top-k results on massive data efficiently. Given table T, T2Sfirst presorts T to obtain table PT(Presorted Table), whose tuples are arranged in the order of the round robin retrieval on the sorted lists. During its execution, T2S only maintains fixed and small number of tuples to compute results. It is proved that T2S has the Characteristic of early termination. It does not need to examine all tuples in PT to return results.

Effective Key Management in Dynamic Wireless Sensor Network

05/08/201902/07/2019 by admin

Data-Driven Composition for Service-Oriented Situational Web Applications

05/08/201902/07/2019 by admin

Continuous and Transparent User Identity Verification for Secure Internet Services

05/08/201902/07/2019 by admin

Continuous and Transparent User IdentityVerification for Secure Internet ServicesAndrea Ceccarelli, Leonardo Montecchi, Francesco Brancati, Paolo Lollini,Angelo Marguglio, and Andrea Bondavalli, Member, IEEEAbstract—Session management in distributed Internet services is traditionally based on username and password, explicit logouts andmechanisms of user session expiration using classic timeouts. Emerging biometric solutions allow substituting username andpassword with biometric data during session establishment, but in such an approach still a single verification is deemed sufficient, andthe identity of a user is considered immutable during the entire session. Additionally, the length of the session timeout may impact onthe usability of the service and consequent client satisfaction. This paper explores promising alternatives offered by applying biometricsin the management of sessions. A secure protocol is defined for perpetual authentication through continuous user verification. Theprotocol determines adaptive timeouts based on the quality, frequency and type of biometric data transparently acquired from the user.The functional behavior of the protocol is illustrated through Matlab simulations, while model-based quantitative analysis is carried outto assess the ability of the protocol to contrast security attacks exercised by different kinds of attackers. Finally, the current prototypefor PCs and Android smartphones is discussed.Index Terms—Security, web servers, mobile environments, authenticationÇ1 INTRODUCTIONSECURE user authentication is fundamental in most ofmodern ICT systems. User authentication systems aretraditionally based on pairs of username and password andverify the identity of the user only at login phase. No checksare performed during working sessions, which are terminatedby an explicit logout or expire after an idle activityperiod of the user.Security of web-based applications is a serious concern,due to the recent increase in the frequency and complexityof cyber-attacks; biometric techniques [10] offer emergingsolution for secure and trusted authentication, where usernameand password are replaced by biometric data. However,parallel to the spreading usage of biometric systems,the incentive in their misuse is also growing, especially consideringtheir possible application in the financial and bankingsectors [20], [11].Such observations lead to arguing that a single authenticationpoint and a single biometric data cannot guarantee asufficient degree of security [5], [7]. In fact, similarly to traditionalauthentication processes which rely on usernameand password, biometric user authentication is typically formulatedas a “single shot” [8], providing user verificationonly during login phase when one or more biometric traitsmay be required. Once the user’s identity has been verified,the system resources are available for a fixed period of timeor until explicit logout from the user. This approachassumes that a single verification (at the beginning of thesession) is sufficient, and that the identity of the user is constantduring the whole session. For instance, we considerthis simple scenario: a user has already logged into a security-critical service, and then the user leaves the PC unattendedin the work area for a while. This problem is eventrickier in the context of mobile devices, often used in publicand crowded environments, where the device itself can belost or forcibly stolen while the user session is active, allowingimpostors to impersonate the user and access strictlypersonal data. In these scenarios, the services where theusers are authenticated can be misused easily [8], [5]. Abasic solution is to use very short session timeouts and periodicallyrequest the user to input his/her credentials overand over, but this is not a definitive solution and heavilypenalizes the service usability and ultimately the satisfactionof users.To timely detect misuses of computer resources and preventthat an unauthorized user maliciously replaces anauthorized one, solutions based on multi-modal biometriccontinuous authentication [5] are proposed, turning user verificationinto a continuous process rather than a onetimeoccurrence [8]. To avoid that a single biometric trait isforged, biometrics authentication can rely on multiple biometricstraits. Finally, the use of biometric authenticationallows credentials to be acquired transparently, i.e., withoutexplicitly notifying the user or requiring his/her interaction,which is essential to guarantee better service usability. Wepresent some examples of transparent acquisition of biometricdata. Face can be acquired while the user is located infront of the camera, but not purposely for the acquisition of_ A. Ceccarelli, L. Montecchi, P. Lollini, and A. Bondavalli are with theDepartment of Mathematics and Informatics, University of Firenze, VialeMorgagni 65, 50134 Firenze, Italy. E-mail: {andrea.ceccarelli,leonardo.montecchi, paolo.lollini, bondavalli}@unifi.it._ F. Brancati is with Resiltech S.R.L., Piazza Iotti 25, 56025 Pontedera,Pisa, Italy. E-mail: francesco.brancati@resiltech.com._ A. Marguglio is with Engineering Ingegneria Informatica S.p.A., VialeRegione Siciliana 7275, 90146 Palermo, Italy.E-mail: angelo.marguglio@eng.it.Manuscript received 12 Nov. 2012; revised 18 Dec. 2013; accepted 22 Dec.2013. Date of publication 8 Jan. 2014; date of current version 15 May 2015.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TDSC.2013.2297709270 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 20151545-5971 _ 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.the biometric data; e.g., the user may be reading a textualSMS or watching a movie on the mobile phone. Voice canbe acquired when the user speaks on the phone, or withother people nearby if the microphone always capturesbackground. Keystroke data can be acquired whenever theuser types on the keyboard, for example, when writing anSMS, chatting, or browsing on the Internet. This approachdifferentiates from traditional authentication processes,where username/password are requested only once at logintime or explicitly required at confirmation steps; such traditionalauthentication approaches impair usability forenhanced security, and offer no solutions against forgery orstealing of passwords.This paper presents a new approach for user verificationand session management that is applied in the contextaware security by hierarchical multilevel architectures(CASHMA) [1]) system for secure biometric authenticationon the Internet. CASHMA is able to operate securely withany kind of web service, including services with high securitydemands as online banking services, and it is intendedto be used from different client devices, e.g., smartphones,Desktop PCs or even biometric kiosks placed at the entranceof secure areas. Depending on the preferences and requirementsof the owner of the web service, the CASHMAauthentication service can complement a traditional authenticationservice, or can replace it.The approach we introduced in CASHMA for usable andhighly secure user sessions is a continuous sequential (a singlebiometric modality at once is presented to the system [22])multi-modal biometric authentication protocol, which adaptivelycomputes and refreshes session timeouts on the basisof the trust put in the client. Such global trust is evaluated asa numeric value, computed by continuously evaluating thetrust both in the user and the (biometric) subsystems used foracquiring biometric data. In the CASHMA context, eachsubsystem comprises all the hardware/software elementsnecessary to acquire and verify the authenticity of one biometrictrait, including sensors, comparison algorithms andall the facilities for data transmission and management.Trust in the user is determined on the basis of frequency ofupdates of fresh biometric samples, while trust in each subsystemis computed on the basis of the quality and varietyof sensors used for the acquisition of biometric samples,and on the risk of the subsystem to be intruded.Exemplary runs carried out using Matlab are reported,and a quantitative model-based security analysis of theprotocol is performed combining the stochastic activitynetworks (SANs [16]) and ADversary VIew Security Evaluation(ADVISE [12]) formalisms.The driving principles behind our protocol were brieflydiscussed in the short paper [18], together with minor qualitativeevaluations. This paper extends [18] both in thedesign and the evaluation parts, by providing an in-depthdescription of the protocol and presenting extensive qualitativeand quantitative analysis.The rest of the paper is organized as follows. Section 2introduces the preliminaries to our work. Section 3 illustratesthe architecture of the CASHMA system, whileSections 4 describes our continuous authentication protocol.Exemplary simulations of the protocol using Matlabare shown in Section 5, while Section 6 presents aquantitative model-based analysis of the security propertiesof the protocol. Section 7 present the running prototype,while concluding remarks are in Section 8.2 PRELIMINARIES2.1 Continuous AuthenticationA significant problem that continuous authentication aimsto tackle is the possibility that the user device (smartphone,table, laptop, etc.) is used, stolen or forcibly taken after theuser has already logged into a security-critical service, orthat the communication channels or the biometric sensorsare hacked.In [7] a multi-modal biometric verification system isdesigned and developed to detect the physical presence ofthe user logged in a computer. The proposed approachassumes that first the user logs in using a strong authenticationprocedure, then a continuous verification process isstarted based on multi-modal biometric. Verification failuretogether with a conservative estimate of the time requiredto subvert the computer can automatically lock it up. Similarly,in [5] a multi-modal biometric verification system ispresented, which continuously verifies the presence of auser working with a computer. If the verification fails, thesystem reacts by locking the computer and by delaying orfreezing the user’s processes.The work in [8] proposes a multi-modal biometric continuousauthentication solution for local access to high-securitysystems as ATMs, where the raw data acquired areweighted in the user verification process, based on i) type ofthe biometric traits and ii) time, since different sensors areable to provide raw data with different timings. Point ii)introduces the need of a temporal integration method whichdepends on the availability of past observations: based onthe assumption that as time passes, the confidence in theacquired (aging) values decreases. The paper applies adegeneracy function that measures the uncertainty of thescore computed by the verification function. In [22], despitethe focus is not on continuous authentication, an automatictuning of decision parameters (thresholds) for sequentialmulti-biometric score fusion is presented: the principle toachieve multimodality is to consider monomodal biometricsubsystems sequentially.In [3] a wearable authentication device (a wristband) ispresented for a continuous user authentication and transparentlogin procedure in applications where users arenomadic. By wearing the authentication device, the usercan login transparently through a wireless channel, and cantransmit the authentication data to computers simplyapproaching them.2.2 Quantitative Security EvaluationSecurity assessment relied for several years on qualitativeanalyses only. Leaving aside experimental evaluation anddata analysis [26], [25], model-based quantitative securityassessment is still far from being an established techniquedespite being an active research area.Specific formalisms for security evaluation have beenintroduced in literature, enabling to some extent the quantificationof security. Attack trees are closely related to faulttrees: they consider a security breach as a system failure,CECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 271and describe sets of events that can lead to system failure ina combinatorial way [14]; they however do not consider thenotion of time. Attack graphs [13] extend attack trees byintroducing the notion of state, thus allowing more complexrelations between attacks to be described. Mission orientedrisk and design analysis (MORDA) assesses system risk bycalculating attack scores for a set of system attacks. Thescores are based on adversary attack preferences and theimpact of the attack on the system [23]. The recently introducedAdversary VIew Security Evaluation formalism [12]extends the attack graph concept with quantitative informationand supports the definition of different attackersprofiles.In CASHMA assessment, the choice of ADVISE wasmainly due to: i) its ability to model detailed adversary profiles,ii) the possibility to combine it with other stochasticformalisms as the M€obius multi-formalism [15], and iii) theability to define ad-hoc metrics for the system we were targeting.This aspect is explored in Section 6.2.3 Novelty of Our ApproachOur continuous authentication approach is grounded ontransparent acquisition of biometric data and on adaptivetimeout management on the basis of the trust posed in theuser and in the different subsystems used for authentication.The user session is open and secure despite possibleidle activity of the user, while potential misuses are detectedby continuously confirming the presence of the proper user.Our continuous authentication protocol significantly differsfrom the work we surveyed in the biometric field as itoperates in a very different context. In fact, it is integrated ina distributed architecture to realize a secure and usableauthentication service, and it supports security-critical webservices accessible over the Internet. We remark thatalthough some very recent initiatives for multi-modal biometricauthentication over the Internet exist (e.g., the BioIDBaaS—Biometric Authentication as a Service is presented in2011 as the first multi-biometric authentication service basedon the Single Sign-On [4]), to the authors’ knowledge none ofsuch approaches supports continuous authentication.Another major difference with works [5] and [7] is thatour approach does not require that the reaction to a userverification mismatch is executed by the user device (e.g.,the logout procedure), but it is transparently handled by theCASHMA authentication service and the web services,which apply their own reaction procedures.The length of the session timeout in CASHMA is calculatedaccording to the trust in the users and the biometricsubsystems, and tailored on the security requirements ofthe service. This provides a tradeoff between usability andsecurity. Although there are similarities with the overallobjectives of the decay function in [8] and the approach forsequential multi-modal system in [22], the reference systemsare significantly different. Consequently, differentrequirements in terms of data availability, frequency, quality,and security threats lead to different solutions [27].2.4 Basic DefinitionsIn this section we introduce the basic definitions that areadopted in this paper. Given n unimodal biometricsubsystems Sk, with k ¼ 1; 2; :::; n that are able to decideindependently on the authenticity of a user, the False Non-Match Rate, FNMRk, is the proportion of genuine comparisonsthat result in false non-matches. False non-match is thedecision of non-match when comparing biometric samplesthat are from same biometric source (i.e., genuine comparison)[10]. It is the probability that the unimodal system Skwrongly rejects a legitimate user. Conversely, the FalseMatch Rate, FMRk, is the probability that the unimodal subsystemSk makes a false match error [10], i.e., it wronglydecides that a non legitimate user is instead a legitimate one(assuming a fault-free and attack-free operation). Obviously,a false match error in a unimodal system would leadto authenticate a non legitimate user. To simplify the discussionbut without losing the general applicability of theapproach, hereafter we consider that each sensor allowsacquiring only one biometric trait; e.g., having n sensorsmeans that at most n biometric traits are used in our sequentialmultimodal biometric system.The subsystem trust level mðSk; tÞ is the probability that theunimodal subsystem Sk at time t does not authenticate animpostor (a non-legitimate user) considering both the qualityof the sensor (i.e., FMRk) and the risk that the subsystemis intruded.The user trust level g(u, t) indicates the trust placed bythe CASHMA authentication service in the user u attime t, i.e., the probability that the user u is a legitimateuser just considering his behavior in terms of device utilization(e.g., time since last keystroke or other action)and the time since last acquisition of biometric data.The global trust level trustðu; tÞ describes the belief that attime t the user u in the system is actually a legitimate user,considering the combination of all subsystems trust levelsmðSk¼1;:::n; tÞ and of the user trust level g(u, t).The trust threshold gmin is a lower threshold on the globaltrust level required by a specific web service; if the resultingglobal trust level at time t is smaller than gmin (i.e.,gðu; tÞ < gmin), the user u is not allowed to access to the service.Otherwise if g(u,t) _ gmin the user u is authenticatedand is granted access to the service.3 THE CASHMA ARCHITECTURE3.1 Overall View of the SystemThe overall system is composed of the CASHMA authenticationservice, the clients and the web services (Fig. 1),connected through communication channels. Each communicationchannel in Fig. 1 implements specific securitymeasures which are not discussed here for brevity.Fig. 1. Overall view of the CASHMA architecture.272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 2015The CASHMA authentication service includes: i) anauthentication server, which interacts with the clients, ii) a setof high-performing computational servers that perform comparisonsof biometric data for verification of the enrolledusers, and iii) databases of templates that contain the biometrictemplates of the enrolled users (these are required for userauthentication/verification). The web services are the variousservices that use the CASHMA authentication service anddemand the authentication of enrolled users to theCASHMA authentication server. These services are potentiallyany kind of Internet service or application withrequirements on user authenticity. They have to be registeredto the CASHMA authentication service, expressingalso their trust threshold. If the web services adopt the continuousauthentication protocol, during the registration processthey shall agree with the CASHMA registration officeon values for parameters h; k and s used in Section 4.2.Finally, by clients we mean the users’ devices (laptop anddesktop PCs, smartphones, tablet, etc.) that acquire the biometricdata (the raw data) corresponding to the various biometrictraits from the users, and transmit those data to theCASHMA authentication server as part of the authenticationprocedure towards the target web service. A client containsi) sensors to acquire the raw data, and ii) theCASHMA application which transmits the biometric data tothe authentication server. The CASHMA authenticationserver exploits such data to apply user authentication andsuccessive verification procedures that compare the rawdata with the stored biometric templates.Transmitting raw data has been a design decisionapplied to the CASHMA system, to reduce to a minimumthe dimension, intrusiveness and complexity of the applicationinstalled on the client device, although we are awarethat the transmission of raw data may be restricted, forexample, due to National legislations.CASHMA includes countermeasures to protect the biometricdata and to guarantee users’ privacy, including policiesand procedures for proper registration; protection ofthe acquired data during its transmission to the authenticationand computational servers and its storage; robustnessimprovement of the algorithm for biometric verification[24]. Privacy issues still exist due to the acquisition of datafrom the surrounding environment as, for example, voicesof people nearby the CASHMA user, but are considered outof scope for this paper.The continuous authentication protocol explored in thispaper is independent from the selected architectural choicesand can work with no differences if templates and featuresets are used instead of transmitting raw data, or independentlyfrom the set of adopted countermeasures.3.2 Sample Application ScenarioCASHMA can authenticate to web services, ranging fromservices with strict security requirements as online bankingservices to services with reduced security requirements asforums or social networks. Additionally, it can grant accessto physical secure areas as a restricted zone in an airport, ora military zone (in such cases the authentication system canbe supported by biometric kiosk placed at the entrance ofthe secure area). We explain the usage of the CASHMAauthentication service by discussing the sample applicationscenario in Fig. 2 where a user u wants to log into an onlinebanking service using a smartphone.It is required that the user and the web service areenrolled to the CASHMA authentication service. Weassume that the user is using a smartphone where aCASHMA application is installed.The smartphone contacts the online banking service,which replies requesting the client to contact the CASHMAauthentication server and get an authentication certificate.Using the CASHMA application, the smartphone sends itsunique identifier and biometric data to the authenticationserver for verification. The authentication server verifies theuser identity, and grants the access if: i) it is enrolled in theCASHMA authentication service, ii) it has rights to accessthe online banking service and, iii) the acquired biometricdata match those stored in the templates database associatedto the provided identifier. In case of successful userverification, the CASHMA authentication server releases anauthentication certificate to the client, proving its identity tothird parties, and includes a timeout that sets the maximumduration of the user session. The client presents this certificateto the web service, which verifies it and grants access tothe client.The CASHMA application operates to continuouslymaintain the session open: it transparently acquires biometricdata from the user, and sends them to the CASHMAauthentication server to get a new certificate. Such certificate,which includes a new timeout, is forwarded to the webservice to further extend the user session.3.3 The CASHMA CertificateIn the following we present the information contained in thebody of the CASHMA certificate transmitted to the client bythe CASHMA authentication server, necessary to understanddetails of the protocol.Time stamp and sequence number univocally identify eachcertificate, and protect from replay attacks.ID is the user ID, e.g., a number.Decision represents the outcome of the verification procedurecarried out on the server side. It includes the expirationtime of the session, dynamically assigned by the CASHMAauthentication server. In fact, the global trust level and thesession timeout are always computed considering the timeinstant in which the CASHMA application acquires the biometricdata, to avoid potential problems related to unknowndelays in communication and computation. Since suchdelays are not predicable, simply delivering a relative timeoutvalue to the client is not feasible: the CASHMA serverFig. 2. Example scenario: accessing an online banking service using asmartphone.CECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 273therefore provides the absolute instant of time at which thesession should expire.4 THE CONTINUOUS AUTHENTICATION PROTOCOLThe continuous authentication protocol allows providingadaptive session timeouts to a web service to set up andmaintain a secure session with a client. The timeout isadapted on the basis of the trust that the CASHMA authenticationsystem puts in the biometric subsystems and in theuser. Details on the mechanisms to compute the adaptivesession timeout are presented in Section 4.2.4.1 Description of the ProtocolThe proposed protocol requires a sequential multi-modalbiometric system composed of n unimodal biometric subsystemsthat are able to decide independently on theauthenticity of a user. For example, these subsystems can beone subsystem for keystroke recognition and one for facerecognition.The idea behind the execution of the protocol is that theclient continuously and transparently acquires and transmitsevidence of the user identity to maintain access to aweb service. The main task of the proposed protocol is tocreate and then maintain the user session adjusting the sessiontimeout on the basis of the confidence that the identityof the user in the system is genuine.The execution of the protocol is composed of two consecutivephases: the initial phase and the maintenance phase.The initial phase aims to authenticate the user into the systemand establish the session with the web service. During themaintenance phase, the session timeout is adaptively updatedwhen user identity verification is performed using fresh rawdata provided by the client to the CASHMA authenticationserver. These two phases are detailed hereafter with thehelp of Figs. 3 and 4.Initial phase. This phase is structured as follows:_ The user (the client) contacts the web service for aservice request; the web service replies that a validcertificate from the CASHMA authentication serviceis required for authentication._ Using the CASHMA application, the client contactsthe CASHMA authentication server. The first stepconsists in acquiring and sending at time t0 the datafor the different biometric traits, specifically selectedto perform a strong authentication procedure (step 1).The application explicitly indicates to the user thebiometric traits to be provided and possible retries._ The CASHMA authentication server analyzes thebiometric data received and performs an authenticationprocedure. Two different possibilities arisehere. If the user identity is not verified (the globaltrust level is below the trust threshold gmin), newor additional biometric data are requested (backto step 1) until the minimum trust threshold gminis reached. Instead if the user identity is successfullyverified, the CASHMA authentication serverauthenticates the user, computes an initial timeoutof length T0 for the user session, set the expirationtime at T0 þ t0, creates the CASHMA certificateand sends it to the client (step 2)._ The client forwards the CASHMA certificate to theweb service (step 3) coupling it with its request._ The web service reads the certificate and authorizesthe client to use the requested service (step 4) untiltime t0 þ T0.For clarity, steps 1-4 are represented in Fig. 3 for the caseof successful user verification only.Maintenance phase. It is composed of three steps repeatediteratively:_ When at time ti the client application acquires fresh(new) raw data (corresponding to one biometric trait),it communicates them to the CASHMA authenticationserver (step 5). The biometric data can beacquired transparently to the user; the user may howeverdecide to provide biometric data which areunlikely acquired in a transparent way (e.g., fingerprint).Finally when the session timeout is going toexpire, the client may explicitly notify to the user thatfresh biometric data are needed._ The CASHMA authentication server receives the biometricdata from the client and verifies the identityof the user. If verification is not successful, the useris marked as not legitimate, and consequently theCASHMA authentication server does not operate torefresh the session timeout. This does not imply thatthe user is cut-off from the current session: if otherbiometric data are provided before the timeoutexpires, it is still possible to get a new certificate andrefresh the timeout. If verification is successful, theCASHMA authentication server applies the algorithmdetailed in Section 4.2 to adaptively compute anew timeout of length Ti, the expiration time of thesession at time Ti þ ti and then it creates and sends anew certificate to the client (step 6)._ The client receives the certificate and forwards it tothe web service; the web service reads the certificateFig. 3. Initial phase in case of successful user authentication.Fig. 4. Maintenance phase in case of successful user verification.274 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 2015and sets the session timeout to expire at time ti þ Ti(step 7).The steps of the maintenance phase are represented inFig. 4 for the case of successful user verification (step 6b).4.2 Trust Levels and Timeout ComputationThe algorithm to evaluate the expiration time of the sessionexecutes iteratively on the CASHMA authentication server.It computes a new timeout and consequently the expirationtime each time the CASHMA authentication server receivesfresh biometric data from a user. Let us assume that the initialphase occurs at time t0 when biometric data is acquiredand transmitted by the CASHMA application of the user u,and that during the maintenance phase at time ti > t0 forany i ¼ 1; :::;m new biometric data is acquired by theCASHMA application of the user u (we assume these dataare transmitted to the CASHMA authentication server andlead to successful verification, i.e., we are in the conditionsof Fig. 4). The steps of the algorithm described hereafter areexecuted.To ease the readability of the notation, in the followingthe user u is often omitted; for example, gðtiÞ ¼ gðu; tiÞ.4.2.1 Computation of Trust in the SubsystemsThe algorithm starts computing the trust in the subsystems.Intuitively, the subsystem trust level could be simply set tothe static value mðSk; tÞ ¼ 1 _ FMRðSkÞ for each unimodalsubsystem Sk and any time t (we assume that informationon the subsystems used, including their FMRs, is containedin a repository accessible by the CASHMA authenticationserver). Instead we apply a penalty function to calibrate thetrust in the subsystems on the basis of its usage. Basically,in our approach the more the subsystem is used, the less itis trusted: to avoid that a malicious user is required tomanipulate only one biometric trait (e.g., through sensorspoofing [10]) to keep authenticated to the online service,we decrease the trust in those subsystems which are repeatedlyused to acquire the biometric data.In the initial phase mðSk; t0Þ is set to 1 _ FMRðSkÞ foreach subsystem Sk used. During the maintenance phase, apenalty function is associated to consecutive authenticationsperformed using the same subsystem as follows:penalty ðx; hÞ ¼ ex_h;where x is the number of consecutive authenticationattempts using the same subsystem and h > 0 is aparameter used to tune the penalty function. This functionincreases exponentially; this means that using the same subsystemfor several authentications heavily increases thepenalty.The computation of the penalty is the first step for thecomputation of the subsystem trust level. If the samesubsystem is used in consecutive authentications, thesubsystem trust level is a multiplication of i) the subsystemtrust level mðSk; ti_1Þ computed in the previous executionof the algorithm, and ii) the inverse of the penaltyfunction (the higher is the penalty, the lower is the subsystemtrust level):mðSk; tiÞ ¼ mðSk; ti_1Þ _ ðpenalty ðx; hÞÞ_1:Otherwise if the subsystem is used for the first time or innon-consecutive user identity verification, mðSk; tiÞ is setto 1 _ FMRðSkÞ. This computation of the penalty is intuitivebut fails if more than one subsystem are compromised(e.g., two fake biometric data can be provided inan alternate way). Other formulations that include thehistory of subsystems usage can be identified but areoutside the scope of this paper.4.2.2 Computation of Trust in the UserAs time passes from the most recent user identity verification,the probability that an attacker substituted to the legitimateuser increases, i.e., the level of trust in the userdecreases. This leads us to model the user trust levelthrough time using a function which is asymptoticallydecreasing towards zero. Among the possible models weselected the function in (1), which: i) asymptoticallydecreases towards zero; ii) yields trustðti_1Þ for D ti ¼ 0;and iii) can be tuned with two parameters which control thedelay ðsÞ and the slope ðkÞ with which the trust leveldecreases over time (Figs. 5 and 6). Different functions maybe preferred under specific conditions or users requirements;in this paper we focus on introducing the protocol,which can be realized also with other functions.During the initial phase, the user trust level is simply setto gðt0Þ ¼ 1. During the maintenance phase, the user trustlevel is computed for each received fresh biometric data.The user trust level at time ti is given by:gðtiÞ ¼__arctanððDti _ sÞ _ kÞ þ p2__ trustðti_1Þ_arctanð_s _ kÞ þ p2: (1)Fig. 5. Evolution of the user trust level when k ¼ ½0:01; 0:05; 0:1_ ands ¼ 40. Fig. 6. Evolution of the user trust level when k ¼ 0:05 and s ¼ ½20; 40; 60_.CECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 275Value D ti ¼ ti _ ti_1 is the time interval betweentwo data transmissions; trustðti_1Þ instead is the globaltrust level computed in the previous iteration of thealgorithm. Parameters k and s are introduced to tune thedecreasing function: k impacts on the inclination towardsthe falling inflection point, while s translates the inflectionpoint horizontally, i.e., allows anticipating or delayingthe decay.Figs. 5 and 6 show the user trust level for different valuesof s and k. Note that s and k allow adapting the algorithm todifferent services: for example, services with strict securityrequirements as banking services may adopt a high k valueand a small s value to have a faster decrease of the user trustlevel. Also we clarify that in Figs. 5, 6 and in the following ofthe paper, we intentionally avoid using measurements unitsfor time quantities (e.g., seconds), since they depend uponthe involved application and do not add significant value tothe discussion.4.2.3 Merging User Trust and Subsystems Trust:The Global Trust LevelThe global trust level is finally computed combining theuser trust level with the subsystem trust level.In the initial phase, multiple subsystems may be used toperform an initial strong authentication. Let n be the numberof different subsystems, the global trust level is firstcomputed during the initial phase as follows:trustðt0Þ ¼ 1 _ Pk¼1;…;nð1 _mðSk; t0ÞÞ: (2)Equation (2) includes the subsystem trust level of all subsystemsused in the initial phase. We remind that for thefirst authentication mðSk; t0Þ is set to 1 _ FMRðSkÞ. The differentsubsystems trust levels are combined adopting theOR-rule from [2], considering only the false acceptance rate:each subsystem proposes a score, and the combined score ismore accurate than the score of each individual subsystem.The first authentication does not consider trust in the userbehavior, and only weights the trust in the subsystems. TheFNMR is not considered in this computation because it onlyimpact on the reliability of the session, while the user trustlevel is intended only for security.Instead, the global trust level in the maintenance phase isa linear combination of the user trust level and the subsystemtrust level. Given the user trust level gðtiÞ and the subsystemtrust level mðSk; tiÞ, the global trust level is computed againadopting the OR-rule from [2], this time with only two inputvalues. Result is as follows:trustðtiÞ ¼ 1 _ ð1 _ gðtiÞÞ ð1 _mðSk; tiÞÞ¼ gðtiÞ þ mðSk; tiÞ _ gðtiÞ mðSk; tiÞ¼ gðtiÞ þ ð1 _ gðtiÞÞ mðSk; tiÞ:(3)4.2.4 Computation of the Session TimeoutThe last step is the computation of the length Ti of the sessiontimeout. This value represents the time required by theglobal trust level to decrease until the trust threshold gmin(if no more biometric data are received). Such value can bedetermined by inverting the user trust level function (1) andsolving it for D ti.Starting from a given instant of time ti, we considertiþ1 as the instant of time at which the global trust levelreaches the minimum threshold gmin, i.e., gðtiþ1Þ ¼ gmin.The timeout is then given by Ti ¼ D ti ¼ tiþ1 _ ti. Toobtain a closed formula for such value we first instantiated(1) for i þ 1, i.e., we substituted trustðti_1Þ withtrustðtiÞ; D ti ¼ Ti and gðtiÞ ¼ gmin.By solving for Ti, we finally obtain Equation (4), whichallows the CASHMA service to dynamically compute thesession timeout based on the current global trust level. Theinitial phase and the maintenance phase are computed inthe same way: the length Ti of the timeout at time ti for theuser u is:Ti ¼ tangmin _ ðarctanð_s _ kÞ _ p2ÞtrustðtiÞþ p2_ __ 1kþs ifTi > 00 otherwise:8<:(4)It is then trivial to set the expiration time of the certificateat Ti þ ti.In Fig. 7 the length Ti of the timeout for different valuesof gmin is shown; the higher is gmin, the higher are the securityrequirements of the web service, and consequently theshorter is the timeout.5 EXEMPLARY RUNSThis section reports Matlab executions of the protocol. Fourdifferent biometric traits acquired through four differentsubsystems are considered for biometric verification: voice,keystroke, fingerprint, and face.We associate the following FMRs to each of them: 0.06 tothe voice recognition system (vocal data is acquired througha microphone), 0.03 to the fingerprint recognition system(the involved sensor is a fingerprint reader; the correspondingbiometric data are not acquired transparently but areexplicitly provided by the user), 0.05 to the facial recognitionsystem (the involved sensor is a camera), and 0.08 tokeystroke recognition (a keyboard or a touch/tactile-screencan be used for data acquisition). Note that the FMRs mustbe set on the basis of the sensors and technologies used. Wealso assume that the initial phase of the protocol needs onlyone raw data.Fig. 7. Timeout values for gmin2 ½0:1; 0:9_; k ¼ 0:05 and s ¼ 40.276 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 2015The first scenario, depicted in Fig. 8, is a simple but representativeexecution of the protocol: in 900 time units, theCASHMA authentication server receives 20 fresh biometricdata from a user and performs successful verifications. Theupper part of Fig. 8 shows the behavior of the user trustlevel (the continuous line) with the gmin threshold (thedashed line) set to gmin¼ 0:7. In the lower graph the evolutionof the session timeout is shown (it is the continuousline). When the continuous line intersects the dashed line,the timeout expires. The time units are reported on thex-axis. The k and s parameters are set to k ¼ 0:05 ands ¼ 100. The first authentication is at time unit 112, followedby a second one at time unit 124. The global trust level afterthese first two authentications is 0.94. The correspondingsession timeout is set to expire at time unit 213: if no freshbiometric data are received before time unit 213, the globaltrust level intersects the threshold gmin. Indeed, this actuallyhappens: the session closes, and the global trust level is setto 0. Session remains closed until a new authentication attime unit 309 is performed. The rest of the experiment runsin a similar way.The next two runs provide two examples of how thethreshold gmin and the parameters k and s can be selected tomeet the security requirements of the web service. We representthe execution of the protocol to authenticate to twoweb services with very different security requirements: thefirst with low security requirements, and the second withsevere security requirements.Fig. 9 describes the continuous authentication protocolfor the first system. The required trust on the legitimacy ofthe user is consequently reduced; session availability andtransparency to the user are favored. The protocol is tunedto maintain the session open with sparse authentications.Given gmin¼ 0:6, and parameters s ¼ 200 and k ¼ 0:005 setfor a slow decrease of user trust level, the plot in Fig. 9 contains10 authentications in 1,000 time units, showing aunique timeout expiration after 190 time units from the firstauthentication.Fig. 10 describes the continuous authentication protocolapplied to a web service with severe security requirements.In this case, session security is preferred to sessionavailability or transparency to the user: the protocol is tunedto maintain the session open only if biometric data are providedfrequently and with sufficient alternation betweenthe available biometric traits. Fig. 10 represents the globaltrust level of a session in which authentication data are provided40 times in 1,000 time units using gmin¼ 0:9, and theparameters s ¼ 90 and k ¼ 0:003 set for rapid decrease.Maintaining the session open requires very frequent transmissionsof biometric data for authentication. This comes atthe cost of reduced usability, because a user which does notuse the device continuously will most likely incur in timeoutexpiration.6 SECURITY EVALUATIONA complete analysis of the CASHMA system was carriedout during the CASHMA project [1], complementing traditionalsecurity analysis techniques with techniques forquantitative security evaluation. Qualitative security analysis,having the objective to identify threats to CASHMA andselect countermeasures, was guided by general andaccepted schemas of biometric attacks and attack points as[9], [10], [11], [21]. A quantitative security analysis of thewhole CASHMA system was also performed [6]. As thispaper focuses on the continuous authentication protocolrather than the CASHMA architecture, we briefly summarizethe main threats to the system identified within theproject (Section 6.1), while the rest of this section (Section6.2) focuses on the quantitative security assessment ofthe continuous authentication protocol.6.1 Threats to the CASHMA SystemSecurity threats to the CASHMA system have been analyzedboth for the enrollment procedure (i.e., initial registrationof an user within the system), and the authenticationprocedure itself. We report here only on authentication. Thebiometric system has been considered as decomposed inFig. 8. Global trust level (top) and session timeout (bottom) in a nominalscenario.Fig. 9. Global trust level and 10 authentications for a service with lowsecurity requirements.Fig. 10. Global trust level and 40 authentications for a service with highsecurity requirements.CECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 277functions from [10]. For authentication, we considered collectionof biometric traits, transmission of (raw) data, featuresextraction, matching function, template search andrepository management, transmission of the matchingscore, decision function, communication of the recognitionresult (accept/reject decision).Several relevant threats exist for each function identified[9], [10], [11]. For brevity, we do not consider threatsgeneric of ICT systems and not specific for biometrics(e.g., attacks aimed to Deny of Service, eavesdropping,man-in-the-middle, etc.). We thus mention the following.For the collection of biometric traits, we identified sensorspoofing and untrusted device, reuse of residuals tocreate fake biometric data, impersonation, mimicry andpresentation of poor images (for face recognition). For thetransmission of (raw) data, we selected fake digital biometric,where an attacker submits false digital biometric data.For the features extraction, we considered insertion ofimposter data, component replacement, override of featureextraction (the attacker is able to interfere with the extractionof the feature set), and exploitation of vulnerabilitiesof the extraction algorithm. For the matching function,attacks we considered are insertion of imposter data, componentreplacement, guessing, manipulation of matchscores. For template search and repository management,all attacks considered are generic for repositories and notspecific to biometric systems. For the transmission of thematching score, we considered manipulation of matchscore. For the decision function, we considered hill climbing(the attacker has access of thematching score, and iterativelysubmits modified data in an attempt to raise theresulting matching score), system parameter override/modification (the attacker has the possibility to change keyparameters as system tolerances in feature matching), componentreplacement, decision manipulation. For the communicationof recognition result, we considered onlyattacks typical of Internet communications.Countermeasures were selected appropriately for eachfunction on the basis of the threats identified.6.2 Quantitative Security Evaluation6.2.1 Scenario and Measures of InterestFor the quantitative security evaluation of the proposedprotocol we consider a mobile scenario, where a registereduser uses the CASHMA service through a client installed ona mobile device like a laptop, a smartphone or a similardevice. The user may therefore lose the device, or equivalentlyleave it unattended for a time long enough for attackersto compromise it and obtain authentication. Moreover,the user may lose the control of the device (e.g., he/she maybe forced to hand over it) while a session has already beenestablished, thus reducing the effort needed by the attacker.In the considered scenario the system works with three biometrictraits: voice, face, and fingerprint.A security analysis on the first authentication performedto acquire the first certificate and open a secure session hasbeen provided in [6]. We assume here that the attacker hasalready been able to perform the initial authentication (or toaccess to an already established session), and we aim toevaluate how long he is able to keep the session alive, atvarying of the parameters of the continuous authenticationalgorithm and the characteristics of the attacker. The measuresof interest that we evaluate in this paper are the following:i) PkðtÞ: Probability that the attacker is able to keep thesession alive until the instant t, given that the session hasbeen established at the instant t ¼ 0; ii) Tk: Mean time forwhich the attacker is able to keep the session alive.Since most of the computation is performed server-side,we focus on attacks targeting the mobile device. In order toprovide fresh biometric data, the attacker has to compromiseone of the three biometric modalities. This can beaccomplished in several ways; for example, by spoofing thebiometric sensors (e.g., by submitting a recorded audio sample,or a picture of the accounted user), or by exploitingcyber-vulnerabilities of the device (e.g., through a “reuse ofresiduals” attack [9]). We consider three kind of abilities forattackers: spoofing, as the ability to perform sensor spoofingattacks, hacking as the ability to perform cyber attacks, andlawfulness, as the degree to which the attacker is prepared tobreak the law.The actual skills of the attacker influence the chance of asuccessful attack, and the time required to perform it. Forexample, having a high hacking skill reduces the timerequired to perform the attack, and also increases the successprobability: an attacker having high technological skillsmay able to compromise the system is such a way that theeffort required to spoof sensors is reduced (e.g., by alteringthe data transmitted by the client device).6.2.2 The ADVISE [12] FormalismThe analysis method supported by ADVISE relies on creatingexecutable security models that can be solved using discrete-event simulation to provide quantitative metrics. Oneof the most significant features introduced by this formalismis the precise characterization of the attacker (the“adversary”) and the influence of its decisions on the finalmeasures of interest.The specification of an ADVISE model is composed oftwo parts: an Attack Execution Graph (AEG), describinghow the adversary can attack the system, and an adversaryprofile, describing the characteristics of the attacker. AnAEG is a particular kind of attack graph comprising differentkinds of nodes: attack steps, access domains, knowledgeitems, attack skills, and attack goals. Attack steps describethe possible attacks that the adversary may attempt, whilethe other elements describe items that can be owned byattackers (e.g., intranet access). Each attack step requires acertain combination of such items to be held by the adversary;the set of what have been achieved by the adversarydefines the current state of the model. ADVISE attack stepshave also additional properties, which allow creating executablemodels for quantitative analysis. The adversary profiledefines the set of items that are initially owned by theadversary, as well as his proficiency in attack skills. Theadversary starts without having reached any goal, andworks towards them. To each attack goal it is assigned apayoff value, which specifies the value that the adversaryassigns to reaching that goal. Three weights define the relativepreference of the adversary in: i) maximizing the payoff,ii) minimizing costs, or iii) minimizing the probability278 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 2015of being detected. Finally, the planning horizon defines thenumber of steps in the future that the adversary is able totake into account for his decisions; this value can be thoughtto model the “smartness” of the adversary.The ADVISE execution algorithm evaluates the reachablestates based on enabled attack steps, and selects the mostappealing to the adversary based on the above describedweights. The execution of the attack is then simulated, leadingthe model to a new state. Metrics are defined usingreward structures [14]. By means of the Rep/Join compositionformalism [15] ADVISE models can be composed withmodels expressed in other formalisms supported by theM€obius framework, and in particular with stochastic activitynetworks [16] models.6.2.3 Modeling ApproachThe model that is used for the analysis combines anADVISE model, which takes into account the attackers’behavior, and a SAN model, which models the evolution oftrust over time due to the continuous authentication protocol.Both models include a set of parameters, which allowevaluating metrics under different conditions and performingsensitivity analysis. Protocol parameters used for theanalysis are reported in the upper labels of Figs. 13 and 14;parameters describing attackers are shown in Table 1 andtheir values are discussed in Section 6.2.4.ADVISE model. The AEG of the ADVISE model is composedof one attack goal, three attack steps, three attackskills, and five access domains. Its graphical representationis shown in Fig. 11, using the notation introduced in [12].The only attack goal present in the model, “RenewSession”represents the renewal of the session timeout by submittingfresh biometric data to the CASHMA server.To reach its goal, the attacker has at its disposal threeattack steps, each one representing the compromise of oneof the three biometric traits: “Compromise_Voice”,“Compromise_Face”, and “Compromise_Fingerprint”.Each of them requires the “SessionOpen” access domain,which represents an already established session. The threeabilities of attackers are represented by three attack skills:“SpoofingSkill”, “HackSkill” and “Lawfulness”.The success probability of such attack steps is a combinationof the spoofing skills of the attacker and the false nonmatchrate (FNMR) of the involved biometric subsystem. Infact, even if the attacker was able to perfectly mimic theuser’s biometric trait, reject would still be possible in case ofa false non-match of the subsystem. For example, the successprobability of the “Compromise_Voice” attack step isobtained as:FNMR Voice_ðSpoofingSkill ->MarkðÞ=1; 000:0Þ;where “FNMR_Voice” is the false non-match rate of thevoice subsystem, and SpoofingSkill ranges from a minimumof 0 to a maximum of 1,000. It should be noted that theactual value assigned to the spoofing skill is a relative value,which also depends on the technological measures implementedto constrast such attack. Based on the skill value,the success probability ranges from 0 (spoofing is not possible)to the FNMR of the subsystem (the same probability ofa non-match for a “genuine” user). The time required to performthe attack is exponentially distributed, and its rate alsodepends on attacker’ skills.When one of the three attack step succeeds, the corresponding“OK_X” access domain is granted to the attacker.Owning one of such access domains means that the systemhas correctly recognized the biometric data, and that it isupdating the global trust level; in this state all the attacksteps are disabled. A successful execution of the attack stepsalso grants the attackers the “RenewSession” goal.“LastSensor” access domain is used to record the last subsystemthat has been used for authentication.SAN model. The SAN model in Fig. 12 models the managementof session timeout and its extension through thecontinuous authentication mechanism. The evolution oftrust level over time is modeled using the functions introducedin Section 4.2; it should be noted that the model introducedin this section can also be adapted to other functionsthat might be used for realizing the protocol.Fig. 11. AEG of the ADVISE model used for security evaluations.TABLE 1Attackers and Their CharacteristicsFig. 12. SAN model for the continuous authentication mechanism.CECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 279Place “SessionOpen” is shared with the ADVISEmodel, and therefore it contains one token if the attackerhas already established a session (i.e., it holds the“SessionOpen” access domain). The extended places“LastTime” and “LastTrust” are used to keep track of thelast time at which the session timeout has been updated,and the corresponding global trust level. These values correspond,respectively, to the quantities t0 and gðt0Þ andcan therefore be used to compute the current global trustlevel g(t). Whenever the session is renewed, the extendedplace “AuthScore” is updated with the global trust levelPðSkÞ of the subsystem that has been used to renew thesession. The extended place “CurrentTimeout” is used tostore the current session timeout, previously calculated attime t0. The activity “Timeout” models the elapsing of thesession timeout and it fires with a deterministic delay,which is given by the value contained in the extended place“CurrentTimeout”. Such activity is enabled only when thesession is open (i.e., place “SessionOpen” contains onetoken). Places “OK_Voice”, “OK_Face” and“OK_Fingerprint” are shared with the respective accessdomains in the ADVISE model. Places “Voice_Consecutive”,“Face_Consecutive”, and “Fingerprint_Consecutive” areused to track the number of consecutive authentications performedusing the same biometric subsystem; this informationis used to evaluate the penalty function.When place “OK_Voice” contains a token, the instantaneousactivity “CalculateScore1” is enabled and fires; theoutput gate “OGSCoreVoice” then sets the marking of place“AuthScore” to the authentication score of the voice subsystem,possibly applying the penalty. The marking of“Voice_Consecutive” is then updated, while the count forthe other two biometric traits is reset. Finally, a token isadded in place “Update”, which enables the immediateactivity “UpdateTrust”. The model has the same behaviorfor the other two biometric traits.When the activity “UpdateTrust” fires, the gate“OGTrustUpdate” updates the user trust level, which iscomputed based on the values in places “LastTrust” and“LastTime”, using (1). Using (3) the current user trust levelis then fused with the score of the authentication that isbeing processed, which has been stored in place“AuthScore”. Finally, the new timeout is computed using(4) and the result is stored in the extended place“CurrentTimeout”. The reactivation predicate of the activity“Timeout” forces the resample of its firing time, and thenew session timeout value is therefore adopted.Composed model. The ADVISE and SAN models are thencomposed using the Join formalism [15]. Places“SessionOpen”, “OK_Voice”, “OK_Face”, and “OK_Fingerprint”are shared with the corresponding access domains inthe ADVISE model. The attack goal “RenewSession” isshared with place “RenewSession”.6.2.4 Definition of AttackersOne of the main challenges in security analysis is the identificationof possible human agents that could pose securitythreats to information systems. The work in [17] defined aThreat Agent Library (TAL) that provides a standardizedset of agent definitions ranging from government spies tountrained employees. TAL classifies agents based on theiraccess, outcomes, limits, resources, skills, objectives, andvisibility, defining qualitative levels to characterize the differentproperties of attackers. For example, to characterizethe proficiency of attackers in skills, four levels are adopted:“none” (no proficiency), “minimal” (can use existing techniques),“operational” (can create new attacks within a narrowdomain) and “adept” (broad expert in suchtechnology). The “Limits” dimension describes legal andethical limits that may constrain the attacker. “Resources”dimension defines the organizational level at which anattacker operates, which in turn determines the amount ofresources available to it for use in an attack. “Visibility”describes the extent to which the attacker intends to hide itsidentity or attacks.Agent threats in the TAL can be mapped to ADVISEadversary profiles with relatively low effort. The “access”attribute is reproduced by assigning different sets of accessdomains to the adversary; the “skills” attribute is mappedto one or more attack skills; the “resources” attribute can beused to set the weight assigned to reducing costs in theADVISE model. Similarly, “visibility” is modeled by theweight assigned to the adversary in avoiding the possibilityof being detected. The attributes “outcomes” and“objectives” are reproduced by attack goals, their payoff,and the weight assigned to maximise the payoff. Finally, theFig. 13. Effect of the continuous authentication mechanism on different Fig. 14. Effect of varying the threshold gmin on the TMA attacker.attackers.280 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 3, MAY/JUNE 2015“limits” attribute can be thought as a specific attack skilldescribing the extent to which the attacker is prepared tobreak the law. In this paper, it is represented by the“Lawfulness” attack skill.In our work we have abstracted four macro-agents thatsummarize the agents identified in TAL, and we havemapped their characteristics to adversary profiles in theADVISE formalism. To identify such macro-agents we firsthave discarded those attributes that are not applicable toour scenario; then we aggregated in a single agent thoseattackers that after this process resulted in similar profiles.Indeed, it should be noted that not all the properties areapplicable in our evaluation; most notably, “objectives” arethe same for all the agents, i.e., extending the session timeoutas much as possible. Similarly “outcome” is notaddressed since it depends upon the application to whichthe CASHMA authentication service provides access. Moreover,in our work we consider hostile threat agents only (i.e.,we do not consider agents 1, 2 and 3 in [17]), as opposed tonon-hostile ones, which include, for example, the“Untrained Employee”.The attributes of the four identified agents are summarizedin Table 1. As discussed in [17], names have the onlypurpose to identify agents; their characteristics should bedevised from agent properties. “Adverse Organization”(ORG) represents an external attacker, with governmentlevelresources (e.g., a terrorist organization or an adversenation-state entity), and having good proficiency in both“Hack” and “Spoofing” skills. It intends to keep its identitysecret, although it does not intend to hide the attack itself. Itdoes not have particular limits, and is prepared to use violenceand commit major extra-legal actions. This attackermaps agents 6, 7, 10, 15, and 18 in [17].“Technology Master Individual” (TMA) represents theattacker for which the term “hacker” is commonly used: anexternal individual having high technological skills, moderate/low resources, and strong will in hide himself and itsattacks. This attacker maps agents 5, 8, 14, 16, and 21 in [17].“Generic Individual” (GEN) is an external individual withlow skills and resources, but high motivation—either rationalor not—that may lead him to use violence. This kind ofattacker does not take care of hiding its actions. The GENattacker maps 4, 13, 17, 19, and 20 in [17]. Finally, the“Insider” attacker (INS) is an internal attacker, having minimalskill proficiency and organization-level resources; it isprepared to commit only minimal extra-legal actions, andone of its main concerns is avoiding him or its attacks beingdetected. This attacker maps agents 9, 11, and 12 in [17].6.2.5 EvaluationsThe composed model has been solved using the discreteeventsimulator provided by the M€obius tool [15]. All themeasures have been evaluated by collecting at least 100.000samples, and using a relative confidence interval of _1 %,confidence level 99 percent. For consistency, the parametersof the decreasing functions are the same as in Fig. 10 ðs ¼ 90and k ¼ 0:003Þ; FMRs of subsystems are also the same usedin simulations of Section 5 (voice: 0.06, fingerprint: 0.03,face: 0.05); for all subsystems, the FNMR has been assumedto be equal to its FMR.Results in Fig. 13 show the effectiveness of the algorithmin contrasting the four attackers. The left part of the figuredepicts the measure PkðtÞ, while Tk is shown in the rightpart. All the attackers maintain the session alive with probability1 for about 60 time units. Such delay is given by theinitial session timeout, which depends upon the characteristicsof the biometric subsystems, the decreasing function(1) and the threshold gmin.With the same parameters a similarvalue was obtained also in MAtlab simulationsdescribed in Section 5 (see Fig. 10): from the highest valueof g(u,t), if no fresh biometric data is received, the globaltrust level reaches the threshold in slightly more than 50time units. By submitting fresh biometric data, all the fourattackers are able to renew the authentication and extendthe session timeout. The extent to which they are able tomaintain the session alive is based on their abilities andcharacteristics.The GEN attacker has about 40 percent probability ofbeing able to renew the authentication and on the averagehe is able to maintain the session for 80 time units. Moreover,after 300 time units he has been disconnected by thesystem with probability 1. The INS and ORG attackers areable to renew the session for 140 and 170 time units onthe average, respectively, due to their greater abilities in thespoofing skill. However, the most threatening agent is theTMA attacker, which has about 90 percent chance to renewthe authentication and is able, on the average, to extend itssession up to 260 time units, which in this setup is morethan four times the initial session timeout. Moreover, theprobability that TMA is able to keep the session alive up to30 time units is about 30 percent, i.e., on the average onceevery three attempts the TMA attacker is able to extend thesession beyond 300 time units, which is roughly five timesthe initial session timeout.Possible countermeasures consist in the correct tuning ofalgorithm parameters based on the attackers to which thesystem is likely to be subject. As an example, Fig. 14 showsthe impact of varying the threshold gmin on the two measuresof interest, PkðtÞ and Tk, with respect to the TMAattacker. Results in the figure show that increasing thethreshold is an effective countermeasure to reduce the averagetime that the TMA attacker is able to keep the sessionalive. By progressively increasing gmin the measure Tkdecreases considerably; this is due to both a reduced initialsession timeout, and to the fact that the attacker has lesstime at his disposal to perform the required attack steps. Asshown in the figure, by setting the threshold to 0.95, theprobability that the TMA attacker is able to keep the sessionalive beyond 300 time units approaches zero, while it isover 30 percent when gmin is set to 0.9.7 PROTOTYPE IMPLEMENTATIONThe implementation of the CASHMA prototype includesface, voice, iris, fingerprint and online dynamic handwrittensignature as biometric traits for biometric kiosks and PCs/laptops, relying on on-board devices when available orpluggable accessories if needed. On smartphones only faceand voice recognition are applied: iris recognition was discardeddue to the difficulties in acquiring high-quality irisscans using the camera of commercial devices, andCECCARELLI ET AL.: CONTINUOUS AND TRANSPARENT USER IDENTITY VERIFICATION FOR SECURE INTERNET SERVICES 281handwritten signature recognition is impractical on most ofsmartphones today available on market (larger displays arerequired). Finally, fingerprint recognition was discardedbecause few smartphones include a fingerprint reader. Theselected biometric traits (face and voice) suit the need to beacquired transparently for the continuous authenticationprotocol described.A prototype of the CASHMA architecture is currentlyavailable, providing mobile components to access a securedweb-application. The client is based on the Adobe Flash [19]technology: it is a specific client, written in Adobe ActionsScript 3, able to access and control the on-board devices inorder to acquire the raw data needed for biometric authentication.In case of smartphones, the CASHMA client componentis realized as a native Android application (using theAndroid SDK API 12). Tests were conducted on smartphonesSamsung Galaxy S II, HTC Desire, HTC Desire HDand HTC Sensation with OS Android 4.0.x. On averagefrom the executed tests, for the smartphones considered weachieved FMR ¼ 2.58% for face recognition and FMR ¼ 10%for voice. The dimensions of biometric data acquired usingthe considered smartphones and exchanged are approximately500 KB. As expected from such limited dimension ofthe data, the acquisition, compression and transmission ofthese data using the mentioned smartphones did not raiseissues on performance or communication bandwidth. Inparticular, the time required to establish a secure sessionand transmit the biometric data was deemed sufficientlyshort to not compromise usability of the mobile device.Regarding the authentication service, it runs on ApacheTomcat 6 servers and Postgres 8.4 databases. The web servicesare, instead, realized using the Jersey library (i.e., aJAX-RS/JSR311 Reference Implementation) for buildingRESTful web services.Finally, the example application is a custom portal developedas a Rich Internet Application using Sencha ExtJS 4JavaScript framework, integrating different external onlineservices (e.g., Gmail, Youtube, Twitter, Flickr) made accessibledynamically following the current trust value of the continuousauthentication protocol.8 CONCLUDING REMARKSWe exploited the novel possibility introduced by biometricsto define a protocol for continuous authentication thatimproves security and usability of user session. The protocolcomputes adaptive timeouts on the basis of the trustposed in the user activity and in the quality and kind of biometricdata acquired transparently through monitoring inbackground the user’s actions.Some architectural design decisions of CASHMA arehere discussed. First, the system exchanges raw data andnot the features extracted from them or templates, whilecripto-token approaches are not considered; as debated inSection 3.1, this is due to architectural decisions where theclient is kept very simple. We remark that our proposedprotocol works with no changes using features, templatesor raw data. Second, privacy concerns should be addressedconsidering National legislations. At present, our prototypeonly performs some checks on face recognition, where onlyone face (the biggest one rusting from the face detectionphase directly on the client device) is considered for identityverification and the others deleted. Third, when data isacquired in an uncontrolled environment, the quality of biometricdata could strongly depend on the surroundings.While performing a client-side quality analysis of the dataacquired would be a reasonable approach to reduce computationalburden on the server, and it is compatible with ourobjective of designing a protocol independent from qualityratings of images (we just consider a sensor trust), this goesagainst the CASHMA requirement of having a light client.We discuss on usability of our proposed protocol. In ourapproach, the client device uses part of its sensors extensivelythrough time, and transmits data on the Internet.This introduces problematic of battery consumption,which has not been quantified in this paper: as discussedin Section 7, we developed and exercised a prototype toverify the feasibility of the approach but a complete assessmentof the solution through experimental evaluation isnot reported. Also, the frequency of the acquisition of biometricdata is fundamental for the protocol usage; if biometricdata are acquired too much sparingly, the protocolwould be basically useless. This mostly depends on theprofile of the client and consequently on his usage of thedevice. Summarizing, battery consumption and user profilemay constitute limitations to our approach, which inthe worst case may require to narrow the applicability ofthe solution to specific cases, for example, only whenaccessing specific websites and for a limited time window,or to grant access to restricted areas (see also the examplesin Section 3.2). This characterization has not been investigatedin this paper and constitute part of our future work.It has to be noticed that the functions proposed for theevaluation of the session timeout are selected amongst a verylarge set of possible alternatives. Although in literature wecould not identify comparable functions used in very similarcontexts, we acknowledge that different functions may beidentified, compared and preferred under specific conditionsor users requirements; this analysis is left out as goesbeyond the scope of the paper, which is the introduction ofthe continuous authentication approach for Internet services.ACKNOWLEDGMENTSThis work was partially supported by the Italian MIURthrough the projects FIRB 2005 CASHMA (DM1621 18 July2005) and PRIN 2010-3P34XC TENACE.