Single Image Super-Resolution Based on Gradient Proﬁle Sharpness

22/10/201902/07/2019 by admin

Real-Time Big Data Analytical Architecture for Remote Sensing Application

05/08/201902/07/2019 by admin

In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.

The proposed architecture comprises three main units:

1) Remote sensing Big Data acquisition unit (RSDU);

2) Data processing unit (DPU); and

3) Data analysis decision unit (DADU).

First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU.

1.2 INTRODUCTION:

Recently, a great deal of interest in the field of Big Data and its analysis has risen mainly driven from extensive number of research challenges strappingly related to bonafide applications, such as modeling, processing, querying, mining, and distributing large-scale repositories. The term “Big Data” classifies specific kinds of data sets comprising formless data, which dwell in data layer of technical computing applications and the Web. The data stored in the underlying layer of all these technical computing application scenarios have some precise individualities in common, such as 1) largescale data, which refers to the size and the data warehouse; 2) scalability issues, which refer to the application’s likely to be running on large scale (e.g., Big Data); 3) sustain extraction transformation loading (ETL) method from low, raw data to well thought-out data up to certain extent; and 4) development of uncomplicated interpretable analytical over Big Data warehouses with a view to deliver an intelligent and momentous knowledge for them.

Big Data are usually generated by online transaction, video/audio, email, number of clicks, logs, posts, social network data, scientific data, remote access sensory data, mobile phones, and their applications. These data are accumulated in databases that grow extraordinarily and become complicated to confine, form, store, manage, share, process, analyze, and visualize via typical database software tools. Advancement in Big Data sensing and computer technology revolutionizes the way remote data collected, processed, analyzed, and managed. Particularly, most recently designed sensors used in the earth and planetary observatory system are generating continuous stream of data. Moreover, majority of work have been done in the various fields of remote sensory satellite image data, such as change detection, gradient-based edge detection region similarity based edge detection and intensity gradient technique for efficient intraprediction.

In this paper, we referred the high speed continuous stream of data or high volume offline data to “Big Data,” which is leading us to a new world of challenges. Such consequences of transformation of remotely sensed data to the scientific understanding are a critical task. Hence the rate at which volume of the remote access data is increasing, a number of individual users as well as organizations are now demanding an efficient mechanism to collect, process, and analyze, and store these data and its resources. Big Data analysis is somehow a challenging task than locating, identifying, understanding, and citing data. Having a large-scale data, all of this has to happen in a mechanized manner since it requires diverse data structure as well as semantics to be articulated in forms of computer-readable format.

However, by analyzing simple data having one data set, a mechanism is required of how to design a database. There might be alternative ways to store all of the same information. In such conditions, the mentioned design might have an advantage over others for certain process and possible drawbacks for some other purposes. In order to address these needs, various analytical platforms have been provided by relational databases vendors. These platforms come in various shapes from software only to analytical services that run in third-party hosted environment. In remote access networks, where the data source such as sensors can produce an overwhelming amount of raw data.

We refer it to the first step, i.e., data acquisition, in which much of the data are of no interest that can be filtered or compressed by orders of magnitude. With a view to using such filters, they do not discard useful information. For instance, in consideration of new reports, is it adequate to keep that information that is mentioned with the company name? Alternatively, is it necessary that we may need the entire report, or simply a small piece around the mentioned name? The second challenge is by default generation of accurate metadata that describe the composition of data and the way it was collected and analyzed. Such kind of metadata is hard to analyze since we may need to know the source for each data in remote access.

1.3 LITRATURE SURVEY:

BIG DATA AND CLOUD COMPUTING: CURRENT STATE AND FUTURE OPPORTUNITIES

AUTHOR: D. Agrawal, S. Das, and A. E. Abbadi

PUBLISH: Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.

EXPLANATION:

Scalable database management systems (DBMS)—both for update intensive application workloads as well as decision support systems for descriptive and deep analytics—are a critical part of the cloud infrastructure and play an important role in ensuring the smooth transition of applications from the traditional enterprise infrastructures to next generation cloud infrastructures. Though scalable data management has been a vision for more than three decades and much research has focussed on large scale data management in traditional enterprise setting, cloud computing brings its own set of novel challenges that must be addressed to ensure the success of data management solutions in the cloud environment. This tutorial presents an organized picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications. Our background study encompasses both classes of systems: (i) for supporting update heavy applications, and (ii) for ad-hoc analytics and decision support. We then focus on providing an in-depth analysis of systems for supporting update intensive web-applications and provide a survey of the state-of-theart in this domain. We crystallize the design choices made by some successful systems large scale database management systems, analyze the application demands and access patterns, and enumerate the desiderata for a cloud-bound DBMS.

CHANGE DETECTION IN SYNTHETIC APERTURE RADAR IMAGE BASED ON FUZZY ACTIVE CONTOUR MODELS AND GENETIC ALGORITHMS

AUTHOR: J. Shi, J. Wu, A. Paul, L. Jiao, and M. Gong

PUBLISH: Math. Prob. Eng., vol. 2014, 15 pp., Apr. 2014.

EXPLANATION:

This paper presents an unsupervised change detection approach for synthetic aperture radar images based on a fuzzy active contour model and a genetic algorithm. The aim is to partition the difference image which is generated from multitemporal satellite images into changed and unchanged regions. Fuzzy technique is an appropriate approach to analyze the difference image where regions are not always statistically homogeneous. Since interval type-2 fuzzy sets are well-suited for modeling various uncertainties in comparison to traditional fuzzy sets, they are combined with active contour methodology for properly modeling uncertainties in the difference image. The interval type-2 fuzzy active contour model is designed to provide preliminary analysis of the difference image by generating intermediate change detection masks. Each intermediate change detection mask has a cost value. A genetic algorithm is employed to find the final change detection mask with the minimum cost value by evolving the realization of intermediate change detection masks. Experimental results on real synthetic aperture radar images demonstrate that change detection results obtained by the improved fuzzy active contour model exhibits less error than previous approaches.

A BIG DATA ARCHITECTURE FOR LARGE SCALE SECURITY MONITORING

AUTHOR: S. Marchal, X. Jiang, R. State, and T. Engel

PUBLISH: Proc. IEEE Int. Congr. Big Data, 2014, pp. 56–63.

EXPLANATION:

Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.

The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.

In the case of simple algorithms, this is quite easy: the input is just split into different non-overlapping pieces that are processed one by one. But most algorithms do consider the neighborhood of each pixel. As a consequence, in most cases, the data will have to be split into partially overlapping pieces. The objective is to obtain the same result as the original algorithm as if the processing was done in one go. Depending on the algorithm, this is unfortunately not always possible.

2.1.1 DISADVANTAGES:

A reader that loads the image, or part of the image in memory from the file on disk;

A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;

A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;

A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.

2.2 PROPOSED SYSTEM:

We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.

On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.

We proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system. Furthermore, the proposed architecture has the capability of storing incoming raw data to perform offline analysis on largely stored dumps, when required. Finally, a detailed analysis of remotely sensed earth observatory Big Data for land and sea area are provided using .NET. In addition, various algorithms are proposed for each level of RSDU, DPU, and DADU to detect land as well as sea area to elaborate the working of architecture.

2.2.1 ADVANTAGES:

Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.

Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.

The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.

We have used the proposed architecture to perform analysis and proposed an algorithm for handling, processing, analyzing, and decision-making for remote sensing Big Data images using our proposed architecture.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.

Healthcare scenarios, medical practitioners gather massive volume of data about patients, medical history, medications, and other details. The above-mentioned data are accumulated in drug-manufacturing companies. The nature of these data is very complex, and sometimes the practitioners are unable to show a relationship with other information, which results in missing of important information. With a view in employing advance analytic techniques for organizing and extracting useful information from Big Data results in personalized medication, the advance Big Data analytic techniques give insight into hereditarily causes of the disease.

4.1 ALGORITHM:

This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.

The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.

ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.

4.2 MODULES:

DATA ANALYSIS DECISION UNIT (DADU):

DATA PROCESSING UNIT (DPU):

REMOTE SENSING APPLICATION RSDU:

FINDINGS AND DISCUSSION:

ALGORITHM DESIGN AND TESTING:

4.3 MODULE DESCRIPTION:

DATA PROCESSING UNIT (DPU):

In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.

Each processing server has its algorithm implementation for processing incoming segment of data from FLBS. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results generated by each server are then sent to the aggregation server for compilation, organization, and storing for further processing.

DATA ANALYSIS DECISION UNIT (DADU):

DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.

The aggregation server also sends the same copy of that result to the decision-making server to process that result for making decision. The decision-making server is supported by the decision algorithms, which inquire different things from the result, and then make various decisions (e.g., in our analysis, we analyze land, sea, and ice, whereas other finding such as fire, storms, Tsunami, earthquake can also be found). The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant since any small error in decision-making can degrade the efficiency of the whole analysis. DADU finally displays or broadcasts the decisions, so that any application can utilize those decisions at real time to make their development. The applications can be any business software, general purpose community software, or other social networks that need those findings (i.e., decision-making).

REMOTE SENSING APPLICATION RSDU:

Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.

However, to make the raw data into image format, the remote sensing satellite uses effective data analysis, remote sensing satellite preprocesses data under many situations to integrate the data from different sources, which not only decreases storage cost, but also improves analysis accuracy. The data must be corrected in different methods to remove distortions caused due to the motion of the platform relative to the earth, platform attitude, earth curvature, nonuniformity of illumination, variations in sensor characteristics, etc. The data is then transmitted to Earth Base Station for further processing using direct communication link. We divided the data processing procedure into two steps, such as real-time Big Data processing and offline Big Data processing. In the case of offline data processing, the Earth Base Station transmits the data to the data center for storage. This data is then used for future analyses. However, in real-time data processing, the data are directly transmitted to the filtration and load balancer server (FLBS), since storing of incoming real-time data degrades the performance of real-time processing.

FINDINGS AND DISCUSSION:

Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.

1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.

2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.

3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.

4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.

5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.

The MPH and SPH data are in ASCII format, whereas all the other data sets are in binary format. MDS, ADS, and GADs consist of the sequence of records and one or more fields of the data for each record. In our case, the MDS contains number of records, and each record contains a number of fields. Each record of the MDS corresponds to one row of the satellite image, which is our main focus during analysis.

ALGORITHM DESIGN AND TESTING:

Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET Framework is a language-neutral platform for writing programs that can easily and securely interoperate. There’s no language barrier with .NET: there are numerous languages available to the developer including Managed C++, C#, Visual Basic and Java Script.

The .NET framework provides the foundation for components to interact seamlessly, whether locally or remotely on different platforms. It standardizes common data types and communications protocols so that components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

The code that targets .NET, and which contains certain extra Information – “metadata” – to describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed code contains the information that allows the CLR to guarantee, for instance, safe execution and interoperability.

Managed Data

With Managed Code comes Managed Data. CLR provides memory allocation and Deal location facilities, and garbage collection. Some .NET languages use Managed Data by default, such as C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR can, depending on the language you’re using, impose certain constraints on the features available. As with managed and unmanaged code, one can have both managed and unmanaged data in .NET applications – data that doesn’t get garbage collected but instead is looked after by unmanaged code.

Common Type System

The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety. This ensures that all classes are compatible with each other, by describing types in a common way. CTS define how types work within the runtime, which enables types in one language to interoperate with types in another language, including cross-language exception handling. As well as ensuring that types are only used in appropriate ways, the runtime also ensures that code doesn’t attempt to access memory that hasn’t been allocated to it.

Common Language Specification

The CLR provides built-in support for language interoperability. To ensure that you can develop managed code that can be fully used by developers using any programming language, a set of language features and rules for using them called the Common Language Specification (CLS) has been defined. Components that follow these rules and expose only CLS features are considered CLS-compliant.

7.3 THE CLASS LIBRARY

.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as well as Object. All objects derive from System. Object. As well as objects, there are value types. Value types can be allocated on the stack, which can provide useful flexibility. There are also efficient means of converting value types to object types if and when necessary.

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

The multi-language capability of the .NET Framework and Visual Studio .NET enables developers to use their existing programming skills to build all types of applications and XML Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to the family.

Visual Basic .NET has been updated to include many new and improved language features that make it a powerful object-oriented programming language. These features include inheritance, interfaces, and overloading, among others. Visual Basic also now supports structured exception handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Managed Extensions for C++ and attributed programming are just some of the enhancements made to the C++ language. Managed Extensions simplify the task of migrating existing C++ applications to the new .NET Framework.

C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid Application Development”. Unlike other languages, its specification is just the grammar of the language. It has no standard library of its own, and instead has been designed with the intention of using the .NET libraries as its own.

Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the world of XML Web Services and dramatically improves the interoperability of Java-language programs with existing software written in a variety of other programming languages.

Active State has created Visual Perl and Visual Python, which enable .NET-aware applications to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET environment. Visual Perl includes support for Active State’s Perl Dev Kit.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

C#.NET is also compliant with CLS (Common Language Specification) and supports structured exception handling. CLS is set of rules and constructs that are supported by the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET Framework; it manages the execution of the code and also makes the development process easier by providing services.

C#.NET is a CLS-compliant language. Any objects, classes, or components that created in C#.NET can be used in any other CLS-compliant language. In addition, we can use objects, classes, and components created in other CLS-compliant languages in C#.NET .The use of CLS ensures complete interoperability among applications, regardless of the languages used to create the application.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them. In other words, destructors are used to release the resources allocated to the object. In C#.NET the sub finalize procedure is available. The sub finalize procedure is used to complete the tasks that must be performed when an object is destroyed. The sub finalize procedure is called automatically when an object is destroyed. In addition, the sub finalize procedure can be called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION

Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated resources, such as objects and variables. In addition, the .NET Framework automatically releases memory for reuse by destroying objects that are no longer in use.

In C#.NET, the garbage collector checks for the objects that are not currently in use by applications. When the garbage collector comes across an object that is marked for garbage collection, it releases the memory occupied by the object.

OVERLOADING

Overloading is another feature in C#. Overloading enables us to define multiple procedures with the same name, where each procedure has a different set of arguments. Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

C#.NET also supports multithreading. An application that supports multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease the time taken by an application to respond to user interaction.

STRUCTURED EXCEPTION HANDLING

C#.NET supports structured handling, which enables us to detect and remove errors at runtime. In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using Try…Catch…Finally statements, we can create robust and effective exception handlers to improve the performance of our application.

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000 Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component. The Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data Services. The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the question from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset. Access either displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.

The algorithms proposed in this paper for each unit and subunits are used to analyze remote sensing data sets, which helps in better understanding of land and sea area. The proposed architecture welcomes researchers and organizations for any type of remote sensory Big Data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible for Big Data analysis for all applications, e.g., sensors and social networking. We are also planning to use the proposed architecture to perform complex analysis on earth observatory data for decision making at realtime, such as earthquake prediction, Tsunami prediction, fire detection, etc.

Real-Time Big Data Analytical Architecture for Remote Sensing Application

05/08/201902/07/2019 by admin

Rank-Based Similarity Search Reducing the Dimensional Dependence

05/08/201902/07/2019 by admin

This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.

1.2 INTRODUCTION

Of the fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, perhaps the most widely-encountered is that of similarity search. Similarity search is the foundation of k-nearest-neighbor (k-NN) classification, which often produces competitively-low error rates in practice, particularly when the number of classes is large. The error rate of nearest-neighbor classification has been shown to be ‘asymptotically optimal’ as the training set size increases. For clustering, many of the most effective and popular strategies require the determination of neighbor sets based at a substantial proportion of the data set objects: examples include hierarchical (agglomerative) methods such as content-based filtering methods for recommender systems and anomaly detection methods commonly make use of k-NN techniques, either through the direct use of k-NN search, or by means of k-NN cluster analysis.

A very popular density-based measure, the Local Outlier Factor (LOF), relies heavily on k-NN set computation to determine the relative density of the data in the vicinity of the test point [8]. For data mining applications based on similarity search, data objects are typically modeled as feature vectors of attributes for which some measure of similarity is defined Motivated at least in part by the impact of similarity search on problems in data mining, machine learning, pattern recognition, and statistics, the design and analysis of scalable and effective similarity search structures has been the subject of intensive research for many decades. Until relatively recently, most data structures for similarity search targeted low-dimensional real vector space representations and the euclidean or other Lp distance metrics.

However, many public and commercial data sets available today are more naturally represented as vectors spanning many hundreds or thousands of feature attributes that can be real or integer-valued, ordinal or categorical, or even a mixture of these types. This has spurred the development of search structures for more general metric spaces, such as the MultiVantage-Point Tree, the Geometric Near-neighbor Access Tree (GNAT), Spatial Approximation Tree (SAT), the M-tree, and (more recently) the Cover Tree (CT). Despite their various advantages, spatial and metric search structures are both limited by an effect often referred to as the curse of dimensionality.

One way in which the curse may manifest itself is in a tendency of distances to concentrate strongly around their mean values as the dimension increases. Consequently, most pairwise distances become difficult to distinguish, and the triangle inequality can no longer be effectively used to eliminate candidates from consideration along search paths. Evidence suggests that when the representational dimension of feature vectors is high (roughly 20 or more traditional similarity search accesses an unacceptably-high proportion of the data elements, unless the underlying data distribution has special properties. Even though the local neighborhood information employed by data mining applications is useful and meaningful, high data dimensionality tends to make this local information very expensive to obtain.

The performance of similarity search indices depends crucially on the way in which they use similarity information for the identification and selection of objects relevant to the query. Virtually all existing indices make use of numerical constraints for pruning and selection. Such constraints include the triangle inequality (a linear constraint on three distance values), other bounding surfaces defined in terms of distance (such as hypercubes or hyperspheres), range queries involving approximation factors as in Locality-Sensitive Hashing (LSH) or absolute quantities as additive distance terms. One serious drawback of such operations based on numerical constraints such as the triangle inequality or distance ranges is that the number of objects actually examined can be highly variable, so much so that the overall execution time cannot be easily predicted.

Similarity search, researchers and practitioners have investigated practical methods for speeding up the computation of neighborhood information at the expense of accuracy. For data mining applications, the approaches considered have included feature sampling for local outlier detection, data sampling for clustering, and approximate similarity search for k-NN classification. Examples of fast approximate similarity search indices include the BD-Tree, a widely-recognized benchmark for approximate k-NN search; it makes use of splitting rules and early termination to improve upon the performance of the basic KD-Tree. One of the most popular methods for indexing, Locality-Sensitive Hashing can also achieve good practical search performance for range queries by managing parameters that influence a tradeoff between accuracy and time.

1.3 LITRATURE SURVEY

THE RELEVANT SET CORRELATION MODEL FOR DATA CLUSTERING

AUTHOR:

PUBLISH:

EXPLANATION:

AUTHOR:

PUBLISH:

EXPLANATION:

AUTHOR:

PUBLISH:

EXPLANATION:

CHAPTER 2

2.0 SYSTEM ANALYSIS:

2.1 EXISTING SYSTEM:

2.1.1 DISADVANTAGES:

2.2 PROPOSED SYSTEM:

2.2.1 ADVANTAGES:

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

.NET

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

4.3 MODULE DESCRIPTION:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

8.2 FUTURE ENHANCEMENT:

PSMPA Patient Self-Controllable and Multi-Level Privacy-Preserving Cooperative Authentication in Dist

05/08/201902/07/2019 by admin

The Distributed m-healthcare cloud computing system considerably facilitates secure and efficient patient treatment for medical consultation by sharing personal health information among the healthcare providers. This system should bring about the challenge of keeping both the data confidentiality and patients’ identity privacy simultaneously. Many existing access control and anonymous authentication schemes cannot be straightforwardly exploited. To solve the problem proposed a novel authorized accessible privacy model (AAPM) is established. Patients can authorize physicians by setting an access tree supporting flexible threshold predicates.

Our new technique of attribute based designated verifier signature, a patient self-controllable multi-level privacy preserving cooperative authentication scheme (PSMPA) realizing three levels of security and privacy requirement in distributed m-healthcare cloud computing system is proposed. The directly authorized physicians, the indirectly authorized physicians and the unauthorized persons in medical consultation can respectively decipher the personal health information and/or verify patients’ identities by satisfying the access tree with their own attribute sets.

1.2 INTRODUCTION:

Distributed m-healthcare cloud computing systems have been increasingly adopted worldwide including the European Commission activities, the US Health Insurance Portability and Accountability Act (HIPAA) and many other governments for efficient and high-quality medical treatment. In m-healthcare social networks, the personal health information is always shared among the patients located in respective social communities suffering from the same disease for mutual support, and across distributed healthcare providers (HPs) equipped with their own cloud servers for medical consultant. However, it also brings about a series of challenges, especially how to ensure the security and privacy of the patients’ personal health information from various attacks in the wireless communication channel such as eavesdropping and tampering As to the security facet, one of the main issues is access control of patients’ personal health information, namely it is only the authorized physicians or institutions that can recover the patients’ personal health information during the data sharing in the distributed m-healthcare cloud computing system. In practice, most patients are concerned about the confidentiality of their personal health information since it is likely to make them in trouble for each kind of unauthorized collection and disclosure.

Therefore, in distributed m-healthcare cloud computing systems, which part of the patients’ personal health information should be shared and which physicians their personal health information should be shared with have become two intractable problems demanding urgent solutions. There has emerged various research results focusing on them. A fine-grained distributed data access control scheme is proposed using the technique of attribute based encryption (ABE). A rendezvous-based access control method provides access privilege if and only if the patient and the physician meet in the physical world. Recently, a patient-centric and fine-grained data access control in multi-owner settings is constructed for securing personal health records in cloud computing. However, it mainly focuses on the central cloud computing system which is not sufficient for efficiently processing the increasing volume of personal health information in m-healthcare cloud computing system.

Moreover, it is not enough for to only guarantee the data confidentiality of the patient’s personal health information in the honest-but-curious cloud server model since the frequent communication between a patient and a professional physician can lead the adversary to conclude that the patient is suffering from a specific disease with a high probability. Unfortunately, the problem of how to protect both the patients’ data confidentiality and identity privacy in the distributed m-healthcare cloud computing scenario under the malicious model was left untouched.

In this paper, we consider simultaneously achieving data confidentiality and identity privacy with high efficiency. As is described in Fig. 1, in distributed m-healthcare cloud computing systems, all the members can be classified into three categories: the directly authorized physicians with green labels in the local healthcare provider who are authorized by the patients and can both access the patient’s personal health information and verify the patient’s identity and the indirectly authorized physicians with yellow labels in the remote healthcare providers who are authorized by the directly authorized physicians for medical consultant or some research purposes (i.e., since they are not authorized by the patients, we use the term ‘indirectly authorized’ instead). They can only access the personal health information, but not the patient’s identity. For the unauthorized persons with red labels, nothing could be obtained. By extending the techniques of attribute based access control and designated verifier signatures (DVS) on de-identified health information

1.3 LITRATURE SURVEY

SECURING PERSONAL HEALTH RECORDS IN CLOUD COMPUTING: PATIENT-CENTRIC AND FINE-GRAINED DATA ACCESS CONTROL IN MULTI-OWNER SETTINGS

AUTHOR: M. Li, S. Yu, K. Ren, and W. Lou

PUBLISH: Proc. 6th Int. ICST Conf. Security Privacy Comm. Netw., 2010, pp. 89–106.

EXPLANATION:

Online personal health record (PHR) enables patients to manage their own medical records in a centralized way, which greatly facilitates the storage, access and sharing of personal health data. With the emergence of cloud computing, it is attractive for the PHR service providers to shift their PHR applications and storage into the cloud, in order to enjoy the elastic resources and reduce the operational cost. However, by storing PHRs in the cloud, the patients lose physical control to their personal health data, which makes it necessary for each patient to encrypt her PHR data before uploading to the cloud servers. Under encryption, it is challenging to achieve fine-grained access control to PHR data in a scalable and efficient way. For each patient, the PHR data should be encrypted so that it is scalable with the number of users having access. Also, since there are multiple owners (patients) in a PHR system and every owner would encrypt her PHR files using a different set of cryptographic keys, it is important to reduce the key distribution complexity in such multi-owner settings. Existing cryptographic enforced access control schemes are mostly designed for the single-owner scenarios. In this paper, we propose a novel framework for access control to PHRs within cloud computing environment. To enable fine-grained and scalable access control for PHRs, we leverage attribute based encryption (ABE) techniques to encrypt each patients’ PHR data. To reduce the key distribution complexity, we divide the system into multiple security domains, where each domain manages only a subset of the users. In this way, each patient has full control over her own privacy, and the key management complexity is reduced dramatically.

PRIVACY AND EMERGENCY RESPONSE IN E-HEALTHCARE LEVERAGING WIRELESS BODY SENSOR NETWORKS

AUTHOR: J. Sun, Y. Fang, and X. Zhu

PUBLISH: IEEE Wireless Commun., vol. 17, no. 1, pp. 66–73, Feb. 2010.

EXPLANATION:

Electronic healthcare is becoming a vital part of our living environment and exhibits advantages over paper-based legacy systems. Privacy is the foremost concern of patients and the biggest impediment to e-healthcare deployment. In addressing privacy issues, conflicts from the functional requirements must be taken into account. One such requirement is efficient and effective response to medical emergencies. In this article, we provide detailed discussions on the privacy and security issues in e-healthcare systems and viable techniques for these issues. Furthermore, we demonstrate the design challenge in the fulfillment of conflicting goals through an exemplary scenario, where the wireless body sensor network is leveraged, and a sound solution is proposed to overcome the conflict.

HCPP: CRYPTOGRAPHY BASED SECURE EHR SYSTEM FOR PATIENT PRIVACY AND EMERGENCY HEALTHCARE

AUTHOR: J. Sun, X. Zhu, C. Zhang, and Y. Fang

PUBLISH: Proc. 31st Int. Conf. Distrib. Comput. Syst., 2011, pp. 373–382.

EXPLANATION:

Privacy concern is arguably the major barrier that hinders the deployment of electronic health record (EHR) systems which are considered more efficient, less error-prone, and of higher availability compared to traditional paper record systems. Patients are unwilling to accept the EHR system unless their protected health information (PHI) containing highly confidential data is guaranteed proper use and disclosure, which cannot be easily achieved without patients’ control over their own PHI. However, cautions must be taken to handle emergencies in which the patient may be physically incompetent to retrieve the controlled PHI for emergency treatment. In this paper, we propose a secure EHR system, HCPP (Healthcaresystem for Patient Privacy), based on cryptographic constructions and existing wireless network infrastructures, to provide privacy protection to patients under any circumstances while enabling timelyPHI retrieval for life-saving treatment in emergency situations. Furthermore, our HCPP system restricts PHI access to authorized (not arbitrary) physicians, who can be traced and held accountable if the accessed PHI is found improperly disclosed. Last but not least, HCPP leverages wireless network access to support efficient and private storage/retrieval of PHI, which underlies a secure and feasible EHR system.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing system data confidentiality is much important but in existing system framework it is not enough for to only guarantee the data confidentiality of the patient’s personal health information in the honest-but-curious cloud server model since the frequent communication between a patient and a professional physician can lead the adversary to conclude that the patient is suffering from a specific disease with a high probability. Unfortunately, the problem of how to protect both the patients’ data confidentiality and identity privacy in the distributed m-healthcare cloud computing scenario under the malicious model was left untouched.

Patients are unwilling to accept the EHR system unless their protected health information (PHI) containing highly confidential data is guaranteed proper use and disclosure, which cannot be easily achieved without patients’ control over their own PHI. However, cautions must be taken to handle emergencies in which the patient may be physically incompetent to retrieve the controlled PHI for emergency treatment a secure EHR system, HCPP (Health care system for Patient Privacy), based on cryptographic constructions and existing wireless network infrastructures, to provide privacy protection to patients under any circumstances while enabling timelyPHI retrieval for life-saving treatment in emergency situations.

2.1.1 DISADVANTAGES:

Existing applications in e-healthcare scenario can be realized through real-time, continuous vital monitoring to give immediate alerts of changes in patient status. Also, the WBAN operates in environments with open access by various people such as hospital or medical organization, which also accommodates attackers. The open wireless channel makes the data prone to be eavesdropped, modified, and injected. Many kinds of security threats have been existed, such as unauthenticated or unauthorized access, message disclosure, message modification, denial-of-service, node capture and compromised node, and routing attacks, etc. Among which two kinds of threats play the leading role, the threats from device compromise and the threats from network dynamics.

Existing problem of security is rising nowadays. Especially, the privacy of communication through Internet may be at risk of attacking in a number of ways. On-line collecting, transmitting, and processing of personal data cause a severe threat to privacy. Once the utilization of Internet-based services is concerned on-line, the lack of privacy in network communication is the main conversation in the public. This problem is far more significant in modern medical environment, as e-healthcare networks are implemented and developed. According to common standards, the network linked with general practitioners, hospitals, and social centers at a national or international scale. While suffering the risk of leaking the privacy data, such networks’ privacy information is facing great danger.

Data confidentiality is low.
Data redundancy is high.
There is a violation in data security.

2.2 PROPOSED SYSTEM:

We presented a new architecture of pseudonymiaztion for protecting privacy in E-health (PIPE) integrated pseudonymization of medical data, identity management, obfuscation of metadata with anonymous authentication to prevent disclosure attacks and statistical analysis in and suggested a secure mechanism guaranteeing anonymity and privacy in both the personal health information transferring and storage at a central m-healthcare cloud server.

We proposed an anonymous authentication of membership in dynamic groups. However, since the anonymous authentication mentioned above are established based on public key infrastructure (PKI), the need of an online certificate authority (CA) and one unique public key encryption for each symmetric key k for data encryption at the portal of authorized physicians made the overhead of the construction grow linearly with size of the group. Furthermore, the anonymity level depends on the size of the anonymity set making the anonymous authentication impractical in specific surroundings where the patients are sparsely distributed.

In this paper, the security and anonymity level of our proposed construction is significantly enhanced by associating it to the underlying Gap Bilinear Diffie-Hellman (GBDH) problem and the number of patients’ attributes to deal with the privacy leakage in patient sparsely distributed scenarios significantly, without the knowledge of which physician in the healthcare provider is professional in treating his illness, the best way for the patient is to encrypt his own PHI under a specified access policy rather than assign each physician a secret key. As a result, the authorized physicians whose attribute set satisfy the access policy can recover the PHI and the access control management also becomes more efficient.

2.2.1 ADVANTAGES:

Our advantages a patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing without privacy-preserving authentication. For comparison, to achieve the same functions of PSMPA, it could be considered as the combination of ABE and DVS that the computational complexity of PSMPA remains constant regardless of the number of directly authorized physicians and nearly half of the combination construction of ABE and DVS supporting flexible predicate.

The communication cost of PSMPA also remains constant; almost half of the combination construction and independent of the number of attributes d in that though the storage overhead of PSMPA is slightly more than the combination construction, it is independent of the number of directly authorized physicians and performs significantly better than traditional DVS, all of whose computational, communication and storage overhead increase linearly to the number of directly authorized physicians.

M-healthcare system is fully controlled and secured with encryption standards.
There is no data loss and data redundancy.
System provides full protection for patient’s data and their attributes.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET 2008
Script : C# Script
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In our implementation, we choose MIRACLE Library for simulating cryptographic operations using Microsoft C/C++ compilers. To achieve a comparable security of 1,024-bit RSA, According to the standards of Paring-based Crypto Librarya patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing [30] without privacy-preserving authentication. For comparison, to achieve the same functions of PSMPA, it could be considered as the combination of ABE and DVS that the computational complexity of PSMPA remains constant regardless of the number of directly authorized physicians and nearly half of the combination construction of ABE and DVS supporting flexible predicate. Fig. 5 illustrates the communication cost of PSMPA also remains constant, almost half of the combination construction and independent of the number of attributes d in vD. Fig. 6 shows that though the storage overhead of PSMPA is slightly more than the combination construction, it is independent of the number of directly authorized physicians and performs significantly better than traditional DVS, all of whose computational, communication and storage overhead increase linearly to the number of directly authorized physicians. that the computational and communication overhead of the combination construction decrease slightly faster than PSMPA as the threshold k increases, however, even when k reaches the maximum value equaling to d, the overheads are still much more than PSMPA. The comparison between our scheme and the anonymous authentication based on PKI the storage, communication and computational overhead towards N and k is identical to DVS, since to realize the same identity privacy, in all the constructions a pair of public key and private key would be assigned to each directly authorized physician and the number of signature operations is also linear to the number of physicians, independent of the threshold k. The simulation results show our PSMPA better adapts to the distributed m-healthcare cloud computing system than previous schemes, especially for enhancing the energy constrained mobile devices (the data sink’s) efficiency.

4.1 ALGORITHM

Attribute Based Designated Verifier Signature Scheme We propose a patient self-controllable and multi-level privacy-preserving cooperative authentication scheme based on ADVS to realize three levels of security and privacy requirement in distributed m-healthcare cloud computing system which mainly consists of the following five algorithms: Setup, Key Extraction, Sign, Verify and Transcript Simulation Generation. Denote the universe of attributes as U.

4.2 MODULES:

E-HEALTHCARE SYSTEM FRAMEWORK:

AUTHORIZED ACCESSIBLE PRIVACY MODEL:

SECURITY VERIFICATION:

PERFORMANCE EVALUATION:

4.3 MODULE DESCRIPTION:

E-healthcare System Framework:

E-healthcare System consists of three components: body area networks (BANs), wireless transmission networks and the healthcare providers equipped with their own cloud servers. The patient’s personal health information is securely transmitted to the healthcare provider for the authorized physicians to access and perform medical treatment. Illustrate the unique characteristics of distributed m-healthcare cloud computing systems where all the personal health information can be shared among patients suffering from the same disease for mutual support or among the authorized physicians in distributed healthcare providers and medical research institutions for medical consultation.

Authorized accessible privacy model:

Multi-level privacy-preserving cooperative authentication is established to allow the patients to authorize corresponding privileges to different kinds of physicians located in distributed healthcare providers by setting an access tree supporting flexible threshold predicates. Propose a novel authorized accessible privacy model for distributed m-healthcare cloud computing systems which consists of the following two components: an attribute based designated verifier signature scheme (ADVS) and the corresponding adversary model.

Security Verification:

The security and anonymity level of our proposed construction is significantly enhanced by associating it to the underlying Gap Bilinear Diffie-Hellman (GBDH) problem and the number of patients’ attributes to deal with the privacy leakage in patient sparsely distributed scenarios. More significantly, without the knowledge of which physician in the healthcare provider is professional in treating his illness, the best way for the patient is to encrypt his own PHI under a specified access policy rather than assign each physician a secret key. As a result, the authorized physicians whose attribute set satisfy the access policy can recover the PHI and the access control management also becomes more efficient.

Performance Evaluation:

The efficiency of PSMPA in terms of storage overhead, computational complexity and communication cost. a patient-centric and fine-grained data access control using ABE to secure personal health records in cloud computing without privacy-preserving authentication. To achieve the same security, our construction performs more efficiently than the traditional designated verifier signature for all the directly authorized physicians, where the overheads are linear to the number of directly authorized physicians.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE ENHANCEMENT:

In this paper, a novel authorized accessible privacy model and a patient self-controllable multi-level privacy preserving cooperative authentication scheme realizing three different levels of security and privacy requirement in the distributed m-healthcare cloud computing system are proposed, followed by the formal security proof and efficiency evaluations which illustrate our PSMPA can resist various kinds of malicious attacks and far outperforms previous schemes in terms of storage, computational and communication overhead.

Privacy-Preserving Detection of Sensitive Data Exposure

05/08/201902/07/2019 by admin

Statistics from security firms, research institutions and government organizations show that the numbers of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced.

In this paper, we present a privacy preserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.

1.2 INTRODUCTION

According to a report from Risk Based Security (RBS), the number of leaked sensitive data records has increased dramatically during the last few years, i.e., from 412 million in 2012 to 822 million in 2013. Deliberately planned attacks, inadvertent leaks (e.g., forwarding confidential emails to unclassified email accounts), and human mistakes (e.g., assigning the wrong privilege) lead to most of the data-leak incidents. Detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content. Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS). Straightforward realizations of data-leak detection require the plaintext sensitive data.

However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory). In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

In this paper, we propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

In our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

Our contributions are summarized as follows.

1) We describe a privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model. We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

2) We implement our detection system and perform extensive experimental evaluation on 2.6 GB Enron dataset, Internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency. Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used. In addition, we give an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

1.3 LITRATURE SURVEY

PRIVACY-AWARE COLLABORATIVE SPAM FILTERING

AUTHORS: K. Li, Z. Zhong, and L. Ramaswamy

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 5, pp. 725–739, May 2009.

EXPLANATION:

While the concept of collaboration provides a natural defense against massive spam e-mails directed at large numbers of recipients, designing effective collaborative anti-spam systems raises several important research challenges. First and foremost, since e-mails may contain confidential information, any collaborative anti-spam approach has to guarantee strong privacy protection to the participating entities. Second, the continuously evolving nature of spam demands the collaborative techniques to be resilient to various kinds of camouflage attacks. Third, the collaboration has to be lightweight, efficient, and scalable. Toward addressing these challenges, this paper presents ALPACAS-a privacy-aware framework for collaborative spam filtering. In designing the ALPACAS framework, we make two unique contributions. The first is a feature-preserving message transformation technique that is highly resilient against the latest kinds of spam attacks. The second is a privacy-preserving protocol that provides enhanced privacy guarantees to the participating entities. Our experimental results conducted on a real e-mail data set shows that the proposed framework provides a 10 fold improvement in the false negative rate over the Bayesian-based Bogofilter when faced with one of the recent kinds of spam attacks. Further, the privacy breaches are extremely rare. This demonstrates the strong privacy protection provided by the ALPACAS system.

DATA LEAK DETECTION AS A SERVICE: CHALLENGES AND SOLUTIONS

AUTHORS: X. Shu and D. Yao

PUBLISH: Proc. 8th Int. Conf. Secur. Privacy Commun. Netw., 2012, pp. 222–240

EXPLANATION:

We describe network-based data-leak detection (DLD) technique, the main feature of which is that the detection does not require the data owner to reveal the content of the sensitive data. Instead, only a small amount of specialized digests are needed. Our technique – referred to as the fuzzy fingerprint – can be used to detect accidental data leaks due to human errors or application flaws. The privacy-preserving feature of our algorithms minimizes the exposure of sensitive data and enables the data owner to safely delegate the detection to others. We describe how cloud providers can offer their customers data-leak detection as an add-on service with strong privacy guarantees. We perform extensive experimental evaluation on the privacy, efficiency, accuracy and noise tolerance of our techniques. Our evaluation results under various data-leak scenarios and setups show that our method can support accurate detection with very small number of false alarms, even when the presentation of the data has been transformed. It also indicates that the detection accuracy does not degrade when partial digests are used. We further provide a quantifiable method to measure the privacy guarantee offered by our fuzzy fingerprint framework.

QUANTIFYING INFORMATION LEAKS IN OUTBOUND WEB TRAFFIC

AUTHORS: K. Borders and A. Prakash

PUBLISH: Proc. 30th IEEE Symp. Secur. Privacy, May 2009, pp. 129–140.

EXPLANATION:

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. Todaypsilas network traffic is so voluminous that manual inspection would be unreasonably expensive. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information. These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet. We present an approach for quantifying information leak capacity in network traffic. Instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume. We take advantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer. In this paper, we present measurement algorithms for the Hypertext Transfer Protocol (HTTP), the main protocol for Web browsing. When applied to real Web browsing traffic, the algorithms were able to discount 98.5% of measured bytes and effectively isolate information leaks.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection, and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content.

Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS).

Straightforward realizations of data-leak detection require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory).

In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

2.1.1 DISADVANTAGES:

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information.

These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet.

Existing approach for quantifying information leak capacity in network traffic instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume.

We take disadvantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer.

2.2 PROPOSED SYSTEM:

We propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations.

Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data.

Our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

Our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them.

To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

2.2.1 ADVANTAGES:

We describe privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model.

We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

We implement our detection system and perform extensive experimental evaluation on internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency.

Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET
Back End : MS-SQL Server
Server : ASP .NET Web Server
Script : C# Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

FUZZY FINGERPRINT METHOD AND PROTOCOL

We describe technical details of our fuzzy fingerprint mechanism in this section. The DLD provider obtains digests of sensitive data from the data owner. The data owner uses a sliding window and Rabin fingerprint algorithm to generate short and hard to-reverse (i.e., one-way) digests through the fast polynomial modulus operation. The sliding window generates small fragments of the processed data (sensitive data or network traffic), which preserves the local features of the data and provides the noise tolerance property. Rabin fingerprints are computed as polynomial modulus operations, and can be implemented with fast XOR, shift, and table look-up operations.

The Rabin fingerprint algorithm has a unique min-wise independence property, which supports fast random fingerprints selection (in uniform distribution) for partial fingerprints disclosure. The shingle-and-fingerprint process is defined as follows. A sliding window is used to generate q-grams on an input binary string first. The fingerprints of q-grams are then computed. A shingle (q-gram) is a fixed-size sequence of contiguous bytes. For example, the 3-gram shingle set of string abcdefgh consists of six elements {abc, bcd, cde, def, efg, fgh}. Local feature preservation is accomplished through the use of shingles. Therefore, our approach can tolerate sensitive data modification to some extent, e.g., inserted tags, small amount of character substitution, and lightly reformatted data.

From the detection perspective, a straightforward method is for the DLD provider to raise an alert if any sensitive fingerprint matches the fingerprints from the traffic.1 However, this approach has a privacy issue. If there is a data leak, there is a match between two fingerprints from sensitive data and network traffic. Then, the DLD provider learns the corresponding shingle, as it knows the content of the packet. Therefore, the central challenge is to prevent the DLD provider from learning the sensitive values even in data-leak scenarios, while allowing the provider to carry out the traffic inspection.

We propose an efficient technique to address this problem. The main idea is to relax the comparison criteria by strategically introducing matching instances on the DLD provider’s side without increasing false alarms for the data owner. Specifically, i) the data owner perturbs the sensitive-data fingerprints before disclosing them to the DLD provider, and ii) the DLD provider detects leaking by a range-based comparison instead of the exact match. The range used in the comparison is pre-defined by the data owner and correlates to the perturbation procedure.

4.2 MODULES:

NETWORK SECURITY PRIVACY:

SECURITY GOAL AND THREAT MODEL:

PRIVACY GOAL AND THREAT MODEL:

PRIVACY-ENHANCING DLD:

EXPERIMENTAL EVALUATION

4.3 MODULE DESCRIPTION:

NETWORK SECURITY PRIVACY:

Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.

SECURITY GOAL AND THREAT MODEL:

We categorize three causes for sensitive data to appear on the outbound traffic of an organization, including the legitimate data use by the employees.

• Case I Inadvertent data leak: The sensitive data is accidentally leaked in the outbound traffic by a legitimate user. This paper focuses on detecting this type of accidental data leaks over supervised network channels. Inadvertent data leak may be due to human errors such as forgetting to use encryption, carelessly forwarding an internal email and attachments to outsiders or due to application flaws (such as described in a supervised network channel could be an unencrypted channel or an encrypted channel where the content in it can be extracted and checked by an authority. Such a channel is widely used for advanced NIDS where MITM (man-in-the-middle) SSL sessions are established instead of normal SSL sessions.

• Case II Malicious data leak: A rogue insider or a piece of stealthy software may steal sensitive personal or organizational data from a host. Because the malicious adversary can use strong private encryption, steganography or covert channels to disable content-based traffic inspection, this type of leaks is out of the scope of our network-based solution host-based defenses (such as detecting the infection onset need to be deployed instead.

• Case III Legitimate and intended data transfer: The sensitive data is sent by a legitimate user intended for legitimate purposes. In this paper, we assume that the data owner is aware of legitimate data transfers and permits such transfers. So the data owner can tell whether a piece of sensitive data in the network traffic is a leak using legitimate data transfer policies.

PRIVACY GOAL AND THREAT MODEL:

DLD provider from gaining knowledge of sensitive data during the detection process, we need to set up a privacy goal that is complementary to the security goal above. We model the DLD provider as a semi-honest adversary, who follows our protocol to carry out the operations, but may attempt to gain knowledge about the sensitive data of the data owner. Our privacy goal is defined as follows. The DLD provider is given digests of sensitive data from the data owner and the content of network traffic to be examined. The DLD provider should not find out the exact value of a piece of sensitive data with a probability greater than 1 K, where K is an integer representing the number of all possible sensitive-data candidates that can be inferred by the DLD provider. We present a privacy-preserving DLD model with a new fuzzy fingerprint mechanism to improve the data protection against semi-honest DLD provider. We generate digests of sensitive data through a one-way function, and then hide the sensitive values among other non-sensitive values via fuzzification. The privacy guarantee of such an approach is much higher than 1 K when there is no leak in traffic, because the adversary’s inference can only be gained through brute-force guesses. The traffic content is accessible by the DLD provider in plaintext. Therefore, in the event of a data leak, the DLD provider may learn sensitive information from the traffic, which is inevitable for all deep packet inspection approaches. Our solution confines the amount of maximal information learned during the detection and provides quantitative guarantee for data privacy.

PRIVACY-ENHANCING DLD:

Our privacy-preserving data-leak detection method supports practical data-leak detection as a service and minimizes the knowledge that a DLD provider may gain during the process. Fig. 1 lists the six operations executed by the data owner and the DLD provider in our protocol. They include PREPROCESS run by the data owner to prepare the digests of sensitive data, RELEASE for the data owner to send the digests to the DLD provider, MONITOR and DETECT for the DLD provider to collect outgoing traffic of the organization, compute digests of traffic content, and identify potential leaks, REPORT for the DLD provider to return data-leak alerts to the data owner where there may be false positives (i.e., false alarms), and POSTPROCESS for the data owner to pinpoint true data-leak instances. Details are presented in the next section. The protocol is based on strategically computing data similarity, specifically the quantitative similarity between the sensitive information and the observed network traffic. High similarity indicates potential data leak. For data-leak detection, the ability to tolerate a certain degree of data transformation in traffic is important. We refer to this property as noise tolerance.

Our key idea for fast and noise-tolerant comparison is the design and use of a set of local features that are representatives of local data patterns, e.g., when byte b2 appears in the sensitive data, it is usually surrounded by bytes b1 and b3 forming a local pattern b1, b2, b3. Local features preserve data patterns even when modifications (insertion, deletion, and substitution) are made to parts of the data. For example, if a byte b4 is inserted after b3, the local pattern b1, b2, b3 is retained though the global pattern (e.g., a hash of the entire document) is destroyed. To achieve the privacy goal, the data owner generates a special type of digests, which we call fuzzy fingerprints. Intuitively, the purpose of fuzzy fingerprints is to hide the true sensitive data in a crowd. It prevents the DLD provider from learning its exact value. We describe the technical details next.

EXPERIMENTAL EVALUATION:

Our data-leak detection solution can be outsourced and be deployed in a fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected.

Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. Our fuzzy fingerprint framework in Python, including packet collection, shingling, Rabin fingerprinting, as well as partial disclosure and fingerprint filter extensions Rabin fingerprint is based on cyclic redundancy code (CRC). We use the padding scheme mentioned in to handle small inputs. In all experiments, the shingles are in 8-byte, and the fingerprints are in 32-bit (33-bit irreducible polynomials in Rabin fingerprint).

We set up a networking environment in VirtualBox, and make a scenario where the sensitive data is leaked from a local network to the Internet. Multiple users’ hosts (Windows 7) are put in the local network, which connect to the Internet via a gateway (Fedora). Multiple servers (HTTP, FTP, etc.) and an attacker-controlled host are put on the Internet side. The gateway dumps the network traffic and sends it to a DLD server/provider (Linux). Using the sensitive-data fingerprints defined by the users in the local network, the DLD server performs off-line data-leak detection.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.1 CONCLUSION

Our fuzzy fingerprint method differs from these solutions and enables its adopter to provide dataleak detection as a service. The customer or data owner does not need to fully trust the DLD provider using our approach. Bloom filter is a space-saving data structure for set membership test, and it is used in network security from network layer in the fuzzy Bloom filter invented in constructs a special Bloom filter that probabilistically sets the corresponding filter bits to 1’s. We designed to support a resource-sufficient routing scheme; it is a potential privacy-preserving technique. We do not invent a variant of Bloom filter for our fuzzy fingerprint, and our fuzzification process is separate from membership test. The advantage of separating fingerprint fuzzification from membership test is that it is flexible to test whether the fingerprint is sensitive with or without fuzzification

Our fuzzy fingerprint solution for data-leak detection, there are other privacy-preserving techniques invented for specific processes, e.g., DATA matching or for general purpose use, e.g., secure multi-party computation (SMC). SMC is a cryptographic mechanism, which supports a wide range of fundamental arithmetic, set, and string operations as well as complex functions such as knapsack computation, automated trouble-shooting, network event statistics, private information retrieval, genomic computation, private database query, private join operations and distributed data mining. The provable privacy guarantees offered by SMC comes at a cost in terms of computational complexity and realization difficulty. The advantage of our approach is its concision and efficiency.

8.2 FUTURE ENHANCEMENT:

We proposed fuzzy fingerprint, a privacy-preserving data-leak detection model and present its realization. Using special digests, the exposure of the sensitive data is kept to a minimum during the detection. We have conducted extensive experiments to validate the accuracy, privacy, and efficiency of our solutions. For future work, we plan to focus on designing a host-assisted mechanism for the complete data-leak detection for large-scale organizations.

Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter

05/08/201902/07/2019 by admin

Panda Public Auditing for Shared Data with Efficient User Revocation in the Cloud

05/08/201902/07/2019 by admin

ABSTRACT:

With data storage and sharing services in the cloud, users can easily modify and share data as a group. To ensure share data integrity can be verified publicly, users in the group need to compute signatures on all the blocks in shared data. Different blocks in shared data are generally signed by different users due to data modifications performed by different users. For security reasons, once a user is revoked from the group, the blocks which were previously signed by this revoked user must be re-signed by an existing user. The straight forward method, which allows an existing user to download the corresponding part of shared data and re-sign it during user revocation, is inefficient due to the large size of shared data in the cloud. In this paper, we propose a novel public auditing mechanism

For the integrity of shared data with efficient user revocation in mind. By utilizing the idea of proxy re-signatures, we allow the cloud tore-sign blocks on behalf of existing users during user revocation, so that existing users do not need to download and re-sign blocks by themselves. In addition, a public verifier is always able to audit the integrity of shared data without retrieving the entire data from the

Cloud, even if some part of shared data has been re-signed by the cloud. Moreover, our mechanism is able to support batch auditing by verifying multiple auditing tasks simultaneously. Experimental results show that our mechanism can significantly improve the efficiency of user revocation.

INTRODUCTION

With data storage and sharing services (such as Dropbox and Google Drive) provided by the cloud, people can easily work together as a group by sharing data with each other. More specifically, once a user creates shared data in the cloud, every user in the group is able to not only access and modify shared data, but also share the latest version of the shared data with the rest of the group. Although cloud providers promise a more secure and reliable environment to the users, the integrity of data in the cloud may still be compromised, due to the existence of hardware/software failures and human errors.

To protect the integrity of data in the cloud, a number of mechanisms have been proposed. In these mechanisms, a signature is attached to each block in data, and the integrity of data relies on the correctness of all the signatures. One of the most significant and common features of these mechanisms is to allow a public verifier to efficiently check data integrity in the cloud without downloading the entire data, referred to as public auditing (or denoted as Provable Data Possession). This public verifier could be a client who would like to utilize cloud data for particular purposes (e.g., search, computation, data mining, etc.) or a thirdparty auditor (TPA) who is able to provide verification services on data integrity to users. Most of the previous works focus on auditing the integrity of personal data. Different from these works, several recent works focus on how to preserve identity privacy from public verifiers when auditing the integrity of shared data. Unfortunately, none of the above mechanisms, considers the efficiency of user revocation when auditing the correctness of shared data in the cloud.

With shared data, once a user modifies a block, she also needs to compute a new signature for the modified block. Due to the modifications from different users, different blocks are signed by different users. For security reasons, when a user leaves the group or misbehaves, this user must be revoked from the group. As a result, this revoked user should no longer be able to access and modify shared data, and the signatures generated by this revoked user are no longer valid to the group. Therefore, although the content of shared data is not changed during user revocation, the blocks, which were previously signed by the revoked user, still need to be re-signed by an existing user in the group. As a result, the integrity of the entire data can still be verified with the public keys of existing users only.

Since shared data is outsourced to the cloud and users no longer store it on local devices, a straightforward method to re-compute these signatures during user revocation is to ask an existing user to first download the blocks previously signed by the revoked user verify the correctness of these blocks, then re-sign these blocks, and finally upload the new signatures to the cloud. However, this straightforward method may cost the existing user a huge amount of communication and computation resources by downloading and verifying blocks, and by re-computing and uploading signatures, especially when the number of re-signed blocks is quite large or the membership of the group is frequently changing. To make this matter even worse, existing users may access their data sharing services provided by the cloud with resource limited devices, such as mobile phones, which further prevents existing users from maintaining the correctness of shared data efficiently during user revocation.

Clearly, if the cloud could possess each user’s private key, it can easily finish the re-signing task for existing users without asking them to download and re-sign blocks. However, since the cloud is not in the same trusted domain with each user in the group, outsourcing every user’s private key to the cloud would introduce significant security issues. Another important problem we need to consider is that the re-computation of any signature during user revocation should not affect the most attractive property of public auditing — auditing data integrity publicly without retrieving the entire data. Therefore, how to efficiently reduce the significant burden to existing users introduced by user revocation, and still allow a public verifier to check the integrity of shared data without downloading the entire data from the cloud, is a challenging task.

In this paper, we propose Panda, a novel public auditing mechanism for the integrity of shared data with efficient user revocation in the cloud. In our mechanism, by utilizing the idea of proxy re-signatures, once a user in the group is revoked, the cloud is able to resign the blocks, which were signed by the revoked user, with a re-signing key. As a result, the efficiency of user revocation can be significantly improved, and computation and communication resources of existing users can be easily saved. Meanwhile, the cloud, who is not in the same trusted domain with each user, is only able to convert a signature of the revoked user into a signature of an existing user on the same block, but it cannot sign arbitrary blocks on behalf of either the revoked user or an existing user. By designing a new proxy re-signature scheme with nice properties, which traditional proxy resignatures do no have, our mechanism is always able to check the integrity of shared data without retrieving the entire data from the cloud.

LITRATURE SURVEY

PUBLIC AUDITING FOR SHARED DATA WITH EFFICIENT USER REVOATION IN THE CLOUD

PUBLICATION: B. Wang, B. Li, and H. Li, in the Proceedings of IEEE INFOCOM 2013, 2013, pp. 2904–2912.

With data storage and sharing services in the cloud, users can easily modify and share data as a group. To ensure shared data integrity can be veriﬁed publicly, users in the group need to compute signatures on all the blocks in shared data. Different blocks in shared data are generally signed by different users due to data modiﬁcations performed by different users. For security reasons, once a user is revoked from the group, the blocks which were previously signed by this revoked user must be re-signed by an existing user. The straightforward method, which allows an existing user to download the corresponding part of shared data and re-sign it during user revocation, is inefﬁcient due to the large size of shared data in the cloud. In this paper, we propose a novel public auditing mechanism for the integrity of shared data with efﬁcient user revocation in mind. By utilizing the idea of proxy re-signatures, we allow the cloud to re-sign blocks on behalf of existing users during user revocation, so that existing users do not need to download and re-sign blocks by themselves. In addition, a public veriﬁer is always able to audit the integrity of shared data without retrieving the entire data from the cloud, even if some part of shared data has been re-signed by the cloud. Moreover, our mechanism is able to support batch auditing by verifying multiple auditing tasks simultaneously. Experimental results show that our mechanism can signiﬁcantly improve the efﬁciency of user revocation.

A VIEW OF CLOUD COMPUTING, COMMUNICATIONS OF THE ACM

PUBLICATION: M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, vol. 53, no. 4, pp. 50–58, Apirl 2010.

Cloud computing, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require the large capital outlays in hardware to deploy their service or the human expense to operate it. They need not be concerned about overprovisioning for a service whose popularity does not meet their predictions, thus wasting costly resources, or underprovisioning for one that becomes wildly popular, thus missing potential customers and revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1,000 servers for one hour costs no more than using one server for 1,000 hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT.

PROVABLE DATA POSSESSION AT UNTRUSTED STORES

PUBLICATION: G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D. Song, in the Proceedings of ACM CCS 2007, 2007, pp. 598–610.

We introduce a model for provable data possession (PDP) that allows a client that has stored data at an untrusted server to verify that the server possesses the original data without retrieving it. The model generates probabilistic proofs of possession by sampling random sets of blocks from the server, which drastically reduces I/O costs. The client maintains a constant amount of metadata to verify the proof. The challenge/response protocol transmits a small, constant amount of data, which minimizes network communication. Thus, the PDP model for remote data checking supports large data sets in widely-distributed storage systems. We present two provably-secure PDP schemes that are more eﬃcient than previous solutions, even when compared with schemes that achieve weaker guarantees. In particular, the overhead at the server is low (or even constant), as opposed to linear in the size of the data. Experiments using our implementation verify the practicality of PDP and reveal that the performance of PDP is bounded by disk I/O and not by cryptographic computation.

COMPACT PROOFS OF RETRIEVABILITY

PUBLICATION: H. Shacham and B. Waters, in the Proceedings of ASIACRYPT 2008. Springer-Verlag,2008,pp. 90–107.

In a proof-of-retrievability system, a data storage center must prove to a verifier that he is actually storing all of a client’s data. The central challenge is to build systems that are both effcient and provably secure | that is, it should be possible to extract the client’s data from any prover that passes a verification check. In this paper, we give the rst proof-of- retrievability schemes with full proofs of security against arbitrary adversaries in the strongest model, that of Juels and Kaliski. Our rst scheme, built from BLS signatures and secure in the random oracle model, features a proof-of-retrievability protocol in which the client’s query and server’s response are both extremely short. This scheme allows public verify ability: anyone can act as a verifier, not just the le owner. Our second scheme, which builds on pseudorandom functions (PRFs) and is secure in the standard model, allows only private verification. It features a proof-of- retrievability protocol with an even shorter server’s response than our rst scheme, but the client’s query is long. Both schemes rely on homomorphic properties to aggregate a proof into one small authenticator value.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

An existing system the file uploaded in cloud which not signed by user in each time of upload. So that integrity of shared data is not possible in existing system. However, since the cloud is not in the same trusted domain with each user in the group, outsourcing every user’s private key to the cloud would introduce significant security issue.

2.1.1 DISADVANTAGES:

2.2 PROPOSED SYSTEM:

In our Proposed system may lie to verifiers about the incorrectness of shared data in order to save the reputation of its data services and avoid losing money on its data services. In addition, we also assume there is no collusion between the cloud and any user during the design of our mechanism. Generally, the incorrectness of share data under the above semi trusted model can be introduced by hardware/software failures or human errors happened in the cloud. Considering these factors, users do not fully trust the cloud with the integrity of shared data.

2.2.1 ADVANTAGES:

1.Blocking User account

2.Security question

3.Login with secret key in each time

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP
Front End : Microsoft Visual Studio .NET 2008
Back End : MS-SQL Server 2005
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data. The physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 BLOCK DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

4.2 MODULES:

1. USER MODULE:

Registration

File Upload

Download

Reupload

Unblock module

2. AUDITOR MODULE:

File Verification module

View File

3. ADMIN MODULE:

View Files

Block user

4.3 MODULE DESCRIPTION:

USER MODULE:

Registration:

In this module each user register his user details for using files. Only registered user can able to login in cloud server .

File Upload:

In this module user upload a block of files in the cloud with encryption by using his secret key. This ensure the files to be protected from unauthorized user.

Download:

This module allows the user to download the file using his secret key to decrypt the downloaded data of blocked user and verify the data and reupload the block of file into cloud server with encryption .This ensure the files to be protected from unauthorized user.

Reupload:

This module allow the user to reupload the downloaded files of blocked user into cloud server with resign the files(i.e) the files is uploaded with new signature like new secret with encryption to protected the data from unauthorized user.

Unblock Module:

This module allow the user to unblock his user account by answering his security question regarding to answer that provided by his at the time of registration. Once the answer is matched to the answer of registration time answer then only account will be unlocked.

AUDITOR MODULE:

File Verification module:

The public verifier is able to correctly check the integrity of shared data. The public verifier can audit the integrity of shared data without retrieving the entire data from the cloud, even if some blocks in shared data have been re-signed by the cloud.

Files View:

In this module public auditor view the all details of upload, download, blocked user, reupload.

ADMIN MODULE:

View Files:

In this module public auditor view the all details of upload, download, blocked user, reupload.

Block User:

In this module admin block the misbehave user account to protect the integrity of shared data

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE SPECIFICATION:

6.1 FEATURES OF .NET:

6.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

6.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

6.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

6.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

6.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.1 CONCLUSION

In this paper, we proposed a new public auditing mechanism for shared data with efficient user revocation in the cloud. When a user in the group is revoked, we allow the semi-trusted cloud to re-sign blocks that were signed by the revoked user with proxy re-signatures. Experimental results show that the cloud can improve the efficiency of user revocation, and existing users in the group can save a significant amount of computation and communication resources during user revocation.

CHAPTER 9

Neighbor Similarity Trust against Sybil Attack in P2P E-Commerce

05/08/201902/07/2019 by admin

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

1.2 INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

1.3 LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

2.1.0 DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

Fake Users Enters Easy.
This makes Sybil attacks.

2.2 PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

2.2.0 ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

It is Helpful to find Sybil Attacks.
It is used to Find Fake UserID.
It is feasible to limit the number of attack edges in online social networks by relationship rating.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.0 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.0 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : Microsoft Visual Studio .NET
Script : C# Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGNS

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

LEVEL 0:

Neighbor Nodes

Source

LEVEL 1:

P2P Sybil Trust Mode

Send Data Request

LEVEL 2:

Data Receive

P2P ACK

Active Attack (Malicious Node)

Send Data Request

LEVEL 3:

3.3 UML DIAGRAMS

3.3.0 USECASE DIAGRAM:

SERVER CLIENT

3.3.1 CLASS DIAGRAM:

3.3.2 SEQUENCE DIAGRAM:

3.4 ACITVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group peers join a group; they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has. Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating. The method of detection of Sybil attack is depicted in Fig. 2. A1 and A2 refer to the same peer but with different identities.

Our approach, the identifiers are only propagated by the peers who exhibit neighbor similarity trust. Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers. SybilTrust proposes that an honest peer should not have an excessive number of neighbors. The neighbors we refer should be member peers existing in a group. The restriction helps to bind the number of peers against any additional attack among the neighbors. If there are too many neighbors, SybilTrust will (internally) only use a subset of the peer’s edges while ignoring all others. Following Liben-Nowell and Kleinberg, we define the attributes of the given pair of peers as the intersection of the sets of similar products.

4.1 MODULES:

SIMILARITY TRUST RELATIONSHIP:

NEIGHBOR SIMILARITY TRUST:

DETECTION OF SYBIL ATTACK:

SECURITY AND PERFORMANCE:

4.2 MODULES DESCRIPTION:

SIMILARITY TRUST RELATIONSHIP:

We focus on the active attacks in P2P e-commerce. When a peer is compromised, all the information will be extracted. In our work, we have proposed use of SybilTrust which is based on neighbor similarity relationship of the peers. SybilTrust is efficient and scalable to group P2P e-commerce network. Sybil attack peers may attempt to compromise the edges or the peers of the group P2P e-commerce. The Sybil attack peers can execute further malicious actions in the network. The threat being addressed is the identity active attacks as peers are continuously doing the transactions in the peers to show that each controller only admitted the honest peers.

Our method makes assumptions that the controller undergoes synchronization to prove whether the peers which acted as distributor of identifiers had similarityor not. If a peer never had similarity, the peer is assumed to have been a Sybil attack peer. Pairing method is used to generate an expander graph with expansion factor of high probability. Every pair of neighbor peers share a unique symmetric secret key (the edge key), established out of band for authenticating each other peers may deliberately cause Byzantine faults in which their multiple identity and incorrect behavior ends up undetected.

The Sybil attack peers can create more non-existent links. The protocols and services for P2P, such as routing protocols must operate efficiently regardless of the group size. In the neighbor similarity trust, peers must have a self-healing in order to recover automatically from any state. Sybil attack can defeat replication and fragmentation performed in distributed hash tables. Geographic routing in P2P can also be a routing mechanism which can be compromised by Sybil peers.

NEIGHBOR SIMILARITY TRUST:

We present a Sybil identification algorithm that takes place in a neighbor similarity trust. The directed graph has edges and vertices. In our work, we assume V is the set of peers and E is the set of edges. The edges in a neighbor similarity have attack edges which are safeguarded from Sybil attacks. A peer u and a Sybil peer v can trade whether one is Sybil or not. Being in a group, comparison can be done to determine the number of peers which trade with peer. If the peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes a peer existing in a group has six types of keys. The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete. Our algorithm adaptively tests the suspected peer while maintaining the neighbor similarity trust connection based on time.

DETECTION OF SYBIL ATTACK:

Sybil attack, a malicious peer must try to present multiple distinct identities. This can be achieved by either generating legal identities or by impersonating other normal peers. Some peers may launch arbitrary attacks to interfere with P2P e-commerce operations, or the normal functioning of the network. According to an attack can succeed to launch a Sybil attack by:

_ Heterogeneous configuration: in this case, malicious peers can have more communication and computation resources than the honest peers.

_ Message manipulation: the attacker can eavesdrop on nearby communications with other parties. This means a attacker gets and interpolates information needed to impersonate others. Major attacks in P2P e-commerce can be classified as passive and active attacks.

_ Passive attack: It listens to incoming and outgoing messages, in order to infer the relevant information from the transmitted recommendations, i.e., eavesdropping, but doesn’t harm the system. A peer can be in passive mode and later in active mode.

_ Active attack: When a malicious peer receives a recommendation for forwarding, it can modify, or when requested to provide recommendations on another peer, it can inflate or bad mouth. The bad mouthing is a situation where a malicious peer may collude with other malicious peers to revenge the honest peer. In the Sybil attack, a malicious peer generates a large number of identities and uses them together to disrupt normal operation.

SECURITY AND PERFORMANCE:

We evaluate the performance of the proposed SybilTrust. We measure two metrics, namely, non-trustworthy rate and detection rate. Non-trustworthy rate is the ratio of the number of honest peers which are erroneously marked as Sybil/malicious peer to the number of total honest peers. Detection rate is the proportion of detected Sybil/ malicious peers to the total Sybil/malicious peers. Communication Cost. The trust level is sent with the recommendation feedback from one peer to another. If a peer is compromised, the information is broadcasted to all peers as revocation of the trust level is being done. Computation Cost. The sybilTrust approach is efficient in the computation of polynomial evaluation. The calculation of the trust level evaluation is based on a pseudo-random function (PRF). PRF is a deterministic function.

In our simulation, we use C# .NET tool. Each honest and malicious peer interacted with a random number of peers defined by a uniform distribution. All the peers are restricted to the group. Our approach, P2P e-commerce community has a total of 3 different categories of interest. The transaction interactions between peers with similar interest can be defined as successful or unsuccessful, expressed as positive or negative respectively. The impact of the first two parameters on performance of the mechanism is evaluated in the percentage of malicious peers replied is randomly chosen by each malicious peer. Transactions with 10 to 40 percent malicious peers are done.

Our SybilTrust approach detects more malicious peers compared to Eigen Trust and Eigen Group Trust [26] as shown in Fig. 4. Fig. 4. shows the detection rates of the P2P when the number of malicious peers increases. When the number of deployed peers is small, e.g., 40 peers, the chance that no peers are around a malicious peer is high. Fig. 4 illustrates the variation of non-trustworthy rates of different numbers of honest peers as the number of malicious peer increases. It is shown that the non-trustworthy rate increases as the number of honest peers and malicious peers increase. The reason is that when there are more malicious peers, the number of target groups is larger. Moreover, this is because neighbor relationship is used to categorize peers in the

We proposed approach. The number of target-groups also increases when the number of honest peers is higher. As a result, the honest peers are examined more times, and the chance that an honest peer is erroneously determined as a Sybil/malicious peer increases, although more Sybil attack peer can also be identified. Fig. 4 displays the detection rate when the reply rate of each malicious peer is the same. The detection rate does not decrease when the reply rate is more than 80 percent, because of the enhancement.

The enhancement could still be found even when a malicious peer replies to almost all of its Sybil attack peer requests. Furthermore, the detection rate is higher as the number of malicious peers becomes more, which means the proposed mechanism is able to resist the Sybil attack from more malicious peers. The detection rate is still more than 80 percent in the sparse network, which according to the definition of a sparse network detection rate reaches 95 percent when the number of legitimate nodes is 300. It is also because the number of target groups increases as the number of malicious peer’s increases and the honest peers are examined more times. Therefore, the rate that an honest peer is erroneously identified as a Sybil/malicious peer also increases.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

Managed Data

Common Type System

Common Language Specification

7.3 THE CLASS LIBRARY

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

CONSTRUCTORS AND DESTRUCTORS:

GARBAGE COLLECTION

OVERLOADING

MULTITHREADING:

STRUCTURED EXCEPTION HANDLING

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION AND FUTURE:

We presented SybilTrust, a defense against Sybil attack in P2P e-commerce. Compared to other approaches, our approach is based on neighborhood similarity trust in a group P2P e-commerce community. This approach exploits the relationship between peers in a neighborhood setting. Our results on real-world P2P e-commerce confirmed fastmixing property hence validated the fundamental assumption behind SybilGuard’s approach. We also describe defense types such as key validation, distribution, and position verification. This method can be done at in simultaneously with neighbor similarity trust which gives better defense mechanism. For the future work, we intend to implement SybilTrust within the context of peers which exist in many groups. Neighbor similarity trust helps to weed out the Sybil peers and isolate maliciousness to specific Sybil peer groups rather than allow attack in honest groups with all honest peers.

Malware Propagation in Large-Scale Networks

05/08/201902/07/2019 by admin

Malware Propagation in Large-Scale NetworksAbstract—Malware is pervasive in networks, and poses a critical threat to network security. However, we have very limitedunderstanding of malware behavior in networks to date. In this paper, we investigate how malware propagates in networks from aglobal perspective. We formulate the problem, and establish a rigorous two layer epidemic model for malware propagation fromnetwork to network. Based on the proposed model, our analysis indicates that the distribution of a given malware follows exponentialdistribution, power law distribution with a short exponential tail, and power law distribution at its early, late and final stages, respectively.Extensive experiments have been performed through two real-world global scale malware data sets, and the results confirm ourtheoretical findings.Index Terms—Malware, propagation, modelling, power lawÇ1 INTRODUCTIONMALWARE are malicious software programs deployedby cyber attackers to compromise computer systemsby exploiting their security vulnerabilities. Motivated byextraordinary financial or political rewards, malware ownersare exhausting their energy to compromise as many networkedcomputers as they can in order to achieve theirmalicious goals. A compromised computer is called a bot,and all bots compromised by a malware form a botnet. Botnetshave become the attack engine of cyber attackers, andthey pose critical challenges to cyber defenders. In order tofight against cyber criminals, it is important for defenders tounderstand malware behavior, such as propagation ormembership recruitment patterns, the size of botnets, anddistribution of bots.To date, we do not have a solid understanding about thesize and distribution of malware or botnets. Researchershave employed various methods to measure the size of botnets,such as botnet infiltration [1], DNS redirection [3],external information [2]. These efforts indicate that the sizeof botnets varies from millions to a few thousand. There areno dominant principles to explain these variations. As aresult, researchers desperately desire effective models andexplanations for the chaos. Dagon et al. [3] revealed thattime zone has an obvious impact on the number of availablebots. Mieghem et al. [4] indicated that network topology hasan important impact on malware spreading through theirrigorous mathematical analysis. Recently, the emergence ofmobile malware, such as Cabir [5], Ikee [6], and Brador [7],further increases the difficulty level of our understandingon how they propagate. More details about mobile malwarecan be found at a recent survey paper [8]. To the best of ourknowledge, the best finding about malware distribution inlarge-scale networks comes from Chen and Ji [9]: the distributionis non-uniform. All this indicates that the research inthis field is in its early stage.The epidemic theory plays a leading role in malwarepropagation modelling. The current models for malwarespread fall in two categories: the epidemiology model andthe control theoretic model. The control system theorybased models try to detect and contain the spread of malware[10], [11]. The epidemiology models are more focusedon the number of compromised hosts and their distributions,and they have been explored extensively in the computerscience community [12], [13], [14]. Zou et al. [15] useda susceptible-infected (SI) model to predict the growth ofInternet worms at the early stage. Gao and Liu [16] recentlyemployed a susceptible-infected-recovered (SIR) model todescribe mobile virus propagation. One critical conditionfor the epidemic models is a large vulnerable populationbecause their principle is based on differential equations.More details of epidemic modelling can be find in [17]. Aspointed by Willinger et al. [18], the findings, which weextract from a set of observed data, usually reflect parts ofthe studied objects. It is more reliable to extract theoreticalresults from appropriate models with confirmation fromsufficient real world data set experiments. We practice thisprinciple in this study.In this paper, we study the distribution of malware interms of networks (e.g., autonomous systems (AS), ISPdomains, abstract networks of smartphones who share thesame vulnerabilities) at large scales. In this kind of setting,we have a sufficient volume of data at a large enough scaleto meet the requirements of the SI model. Different from the_ S. Yu is with the School of Information Technology, Deakin University,Burwood, Victoria 3125, Australia. E-mail: syu@deakin.edu.au._ G. Gu is with the Department of Computer Science and Engineering,Texas A&M University, College Station, TX 77843-3112.E-mail: guofei@cse.tamu.edu._ A. Barnawi is with the Faculty of Computing and IT, King AbdulazizUniversity, Jeddah, Saudi Arabia. E-mail: ambarnawi@kau.edu.sa._ S. Guo is with the School of Computer Science and Engineering, The Universityof Aizu, Aizuwakamatsu, Japan. E-mail: sguo@u-aizu.ac.jp._ I. Stojmenovic is with the School of Information Technology, DeakinUniversity, Australia; King Abdulaziz University, Jeddah, Saudi Arabia;and the School of EECS, University of Ottawa, Ottawa, ON K1N 6N5,Canada. E-mail: ivan@site.uottawa.ca.Manuscript received 1 Jan. 2014; revised 14 Apr. 2014; accepted 15 Apr. 2014.Date of publication 28 Apr. 2014; date of current version 1 Dec. 2014.Recommended for acceptance by F. Bonchi.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2320725170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 20151041-4347 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.traditional epidemic models, we break our model into twolayers. First of all, for a given time since the breakout of amalware, we calculate how many networks have been compromisedbased on the SI model. Second, for a compromisednetwork, we calculate how many hosts have beencompromised since the time that the network was compromised.With this two layer model in place, we can determinethe total number of compromised hosts and theirdistribution in terms of networks. Through our rigorousanalysis, we find that the distribution of a given malwarefollows an exponential distribution at its early stage, andobeys a power law distribution with a short exponential tailat its late stage, and finally converges to a power law distribution.We examine our theoretical findings through twolarge-scale real-world data sets: the Android based malware[19] and the Conficker [20]. The experimental resultsstrongly support our theoretical claims. To the best of ourknowledge, the proposed two layer epidemic model andthe findings are the first work in the field.Our contributions are summarized as follows._ We propose a two layer malware propagation modelto describe the development of a given malware atthe Internet level. Compared with the existing singlelayer epidemic models, the proposed model representsmalware propagation better in large-scalenetworks._ We find the malware distribution in terms of networksvaries from exponential to power law witha short exponential tail, and to power law distributionat its early, late, and final stage, respectively.These findings are first theoretically provedbased on the proposed model, and then confirmedby the experiments through the two large-scalereal-world data sets.The rest of the paper is structured as follows. Relatedwork is briefly listed in Section 2. We present the preliminariesfor the proposed model in Section 3. The studiedproblem is discussed in Section 4. A two layer malwarepropagation model is established in Section 5, and followedby a rigorous mathematical analysis in Section 6. Experimentsare conducted to confirm our findings in Section 7. InSection 8, we provide a further discussion about the study.Finally, we summarize the paper and present future workin Section 9.2 RELATED WORKThe basic story of malware is as follows. A malware programerwrites a program, called bot or agent, and theninstalls the bots at compromised computers on the Internetusing various network virus-like techniques. All ofhis bots form a botnet, which is controlled by its ownersto commit illegal tasks, such as launching DDoS attacks,sending spam emails, performing phishing activities, andcollecting sensitive information. There is a command andcontrol (C&C) server(s) to communicate with the bots andcollect data from bots. In order to disguise himself fromlegal forces, the botmaster changes the url of his C&C frequently,e.g., weekly. An excellent explanation about thiscan be found in [1].With the significant growing of smartphones, we havewitnessed an increasing number of mobile malware. Malwarewriters have develop many mobile malware in recentyears. Cabir [5] was developed in 2004, and was the firstmalware targeting on the Symbian operating system formobile devices. Moreover, it was also the first malwarepropagating via Bluetooth. Ikee [6] was the first mobile malwareagainst Apple iPhones, while Brador [7] was developedagainst Windows CE operating systems. The attackvictors for mobile malware are diverse, such as SMS, MMS,Bluetooth, WiFi, and Web browsing. Peng et al. [8] presentedthe short history of mobile malware since 2004, andsurveyed their propagation models.A direct method to count the number of bots is to use botnetinfiltration to count the bot IDs or IP addresses. Stone-Gross et al. [1] registered the URL of the Torpig botnetbefore the botmaster, and therefore were able to hijack theC&C server for ten days, and collect about 70G data fromthe bots of the Torpig botnet. They reported that the footprintof the Torpig botnet was 182,800, and the median andaverage size of the Torpig’s live population was 49,272 and48,532, respectively. They found 49,294 new infections duringthe ten days takeover. Their research also indicated thatthe live population fluctuates periodically as users switchbetween being online and offline. This issue was also tackedby Dagon et al. in [3].Another method is to use DNS redirection. Dagon et al.[3] analyzed captured bots by honypot, and then identifiedthe C&C server using source code reverse engineeringtools. They then manipulated the DNS entry which isrelated to a botnet’s IRC server, and redirected the DNSrequests to a local sinkhole. They therefore could countthe number of bots in the botnet. As discussed previously,their method counts the footprint of the botnet, whichwas 350,000 in their report.In this paper, we use two large scale malware data setsfor our experiments. Conficker is a well-known and one ofthe most recently widespread malware. Shin et al. [20] collecteda data set about 25 million Conficker victims from allover the world at different levels. At the same time, malwaretargeting on Android based mobile systems are developingquickly in recent years. Zhou and Jiang [19] collecteda large data set of Android based malware.In [2], Rajab et al. pointed out that it is inaccurate tocount the unique IP addresses of bots because DHCP andNAT techniques are employed extensively on the Internet([1] confirms this by their observation that 78.9 percent ofthe infected machines were behind a NAT, VPN, proxy,or firewall). They therefore proposed to examine the hitsof DNS caches to find the lower bound of the size of agiven botnet.Rajab et al. [21] reported that botnets can be categorizedinto two major genres in terms of membership recruitment:worm-like botnets and variable scanning botnets. The latterweights about 82 percent in the 192 IRC bots that they investigated,and is the more prevalent class seen currently. Suchbotnets usually perform localized and non-uniform scanning,and are difficult to track due to their intermittent andcontinuously changing behavior. The statistics on the lifetimeof bots are also reported as 25 minutes on average with90 percent of them staying for less than 50 minutes.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 171Malware propagation modelling has been extensivelyexplored. Based on epidemiology research, Zou et al. [15]proposed a number of models for malware monitoring atthe early stage. They pointed out that these kinds of modelare appropriate for a system that consists of a large numberof vulnerable hosts; in other words, the model is effective atthe early stage of the outbreak of malware, and the accuracyof the model drops when the malware develops further. Asa variant of the epidemic category, Sellke et al. [12] proposeda stochastic branching process model for characterizingthe propagation of Internet worms, which especiallyfocuses on the number of compromised computers againstthe number of worm scans, and presented a closed formexpression for the relationship. Dagon et al. [3] extendedthe model of [15] by introducing time zone information aðtÞ,and built a model to describe the impact on the number oflive members of botnets with diurnal effect.The impact of side information on the spreading behaviorof network viruses has also been explored. Ganesh et al.[22] thoroughly investigated the effect of network topologyon the spead of epidemics. By combining Graph theory anda SIS (susceptible—infective—susceptible) model, theyfound that if the ratio of cure to infection rates is smallerthan the spectral radius of the graph of the studied network,then the average epidemic lifetime is of order log n, where nis the number of nodes. On the other hand, if the ratio islarger than a generalization of the isoperimetric constant ofthe graph, then the average epidemic lifetime is of order ena,where a is a positive constant. Similarly, Mieghem et al. [4]applied the N-intertwined Markov chain model, an applicationof mean field theory, to analyze the spread of viruses innetworks. They found that tc ¼ 1_maxðAÞ, where tc is the sharpepidemic threshold, and _maxðAÞ is the largest eigenvalue ofthe adjacency matrix A of the studied network. Moreover,there have been many other methodologies to tackle theproblem, such as game theory [23].3 PRELIMINARIESPreliminaries of epidemic modelling and complex networksare presented in this section as this work is mainly based onthe two fields.For the sake of convenience, we summarize the symbolsthat we use in this paper in Table 1.3.1 Deterministic Epidemic ModelsAfter nearly 100 years development, the epidemic models[17] have proved effective and appropriate for a system thatpossesses a large number of vulnerable hosts. In otherwords, they are suitable at a macro level. Zou et al. [15]demonstrated that they were suitable for the studies ofInternet based virus propagation at the early stage.We note that there are many factors that impact the malwarepropagation or botnet membership recruitment, suchas network topology, recruitment frequency, and connectionstatus of vulnerable hosts. All these factors contribute to thespeed of malware propagation. Fortunately, we can includeall these factors into one parameter as infection rate b inepidemic theory. Therefore, in our study, let N be the totalnumber of vulnerable hosts of a large-scale network (e.g., theInternet) for a given malware. There are two statuses for anyone of the N hosts, either infected or susceptible. Let IðtÞ bethe number of infected hosts at time t, then we havedIðtÞdt ¼ bðtÞ½N _ RðtÞ _ IðtÞ _ QðtÞ_IðtÞ _dRðtÞdt; (1)where RðtÞ, and QðtÞ represent the number of removedhosts from the infected population, and the number ofremoved hosts from the susceptible population at time t.The variable bðtÞ is the infection rate at time t.For our study, model (1) is too detailed and not necessaryas we expect to know the propagation and distribution of agiven malware. As a result, we employ the following susceptible-infected model:dIðtÞdt ¼ bIðtÞ½N _ IðtÞ_; (2)where the infection rate b is a constant for a given malwarefor any network.We note that the variable t is continuous in model (2) and(1). In practice, we measure IðtÞ at discrete time points.Therefore, t ¼ 0; 1; 2; . . . . We can interpret each time pointas a new round of malware membership recruitment, suchas vulnerable host scanning. As a result, we can transformmodel (2) into the discrete form as follows:IðtÞ ¼ ð1 þ aDÞIðt _ 1Þ _ bDIðt _ 1Þ2; (3)where t ¼ 0; 1; 2; . . . ; D is the unit of time, Ið0Þ is the initialnumber of infected hosts (we also call them seeds in thispaper), and a ¼ bN, which represents the average numberof vulnerable hosts that can be infected by one infected hostper time unit.In order to simplify our analysis, let D ¼ 1, it could beone second, one minute, one day, or one month, even oneyear, depending on the time scale in a given context. Hence,we have a simpler discrete form given byIðtÞ ¼ ð1 þ aÞIðt _ 1Þ _ bðIðt _ 1ÞÞ2: (4)Based on Equation (4), we define the increase of infectedhosts for each time unit as follows.DIðtÞ , IðtÞ _ Iðt _ 1Þ; t ¼ 1; 2; . . . : (5)To date, many researches are confined to the “earlystage” of an epidemic, such as [15]. Under the early stagecondition, IðtÞ << N, therefore, N _ IðtÞ _ N. As a result,a closed form solution is obtained as follows:IðtÞ ¼ Ið0ÞebNt: (6)TABLE 1Notations of Symbols in This Paper172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015Whenwe take the ln operation on both sides of Equation (6),we haveln IðtÞ ¼ bNt þ ln Ið0Þ: (7)For a given vulnerable network, b, N and Ið0Þ are constants,therefore, the graphical representation of Equation (7)is a straight line.Based on the definition of Equation (5), we obtain theincrease of new members of a malware at the early stage asDIðtÞ ¼ ðebN _ 1ÞIðt _ 1Þ¼ ðebN _ 1ÞIð0ÞebNðt_1Þ: (8)Taking the ln operation on both side of (8), we haveln DIðtÞ ¼ bNðt _ 1Þ þ ln ððebN _ 1ÞIð0ÞÞ: (9)Similar to Equation (7), the graphical representation ofequation (9) is also a straight line. In other words, the numberof recruited members for each round follows an exponentialdistribution at the early stage.We have to note that it is hard for us to know whetheran epidemic is at its early stage or not in practice. Moreover,there is no mathematical definition about the termearly stage.In epidemic models, the infection rate b has a criticalimpact on the membership recruitment progress, and b isusually a small positive number, such as 0.00084 for wormCode Red [12]. For example, for a network with N ¼ 10;000vulnerable hosts, we show the recruited members underdifferent infection rates in Fig. 1. From this diagram, we cansee that the recruitment goes slowly when b ¼ 0:0001, however,all vulnerable hosts have been compromised in lessthan 7 time units when b ¼ 0:0003, and the recruitment progressesin an exponential fashion.This reflects the malware propagation styles in practice.For malware based on “contact”, such as blue tooth contacts,or viruses depending on emails to propagate, theinfection rate is usually small, and it takes a long time tocompromise a large number of vulnerable hosts in a givennetwork. On the other hand, for some malware, which takeactive actions for recruitment, such as vulnerable host scanning,it may take one or a few rounds of scanning to recruitall or a majority of the vulnerable hosts in a given network.We will apply this in the following analysis and performanceevaluation.3.2 Complex NetworksResearch on complex networks have demonstrated that thenumber of hosts of networks follows the power law. Peoplefound that the size distribution usually follows the powerlaw, such as population in cities in a country or personalincome in a nation [24]. In terms of the Internet, researchershave also discovered many power law phenomenon, suchas the size distribution of web files [25]. Recent progressesreported in [26] further demonstrated that the size of networksfollows the power law.The power law has two expression forms: the Pareto distributionand the Zipf distribution. For the same objects ofthe power law, we can use any one of them to represent it.However, the Zipf distributions are tidier than the expressionof the Pareto distributions. In this paper, we will useZipf distributions to represent the power law. The Zipfexpression is as follows:Prfx ¼ ig ¼Cia ; (10)where C is a constant, a is a positive parameter, calledthe Zipf index, Prfx ¼ ig represents the probability of theith ði ¼ 1; 2; . . .P Þ largest object in terms of size, andi Prfx ¼ ig ¼ 1.A more general form of the distribution is called theZipf-Mandelbrot distribution [27], which is defined asfollows:Prfx ¼ ig ¼Cði þ qÞa ; (11)where the additional constant q ðq _ 0Þ is called the plateaufactor, which makes the probability of the highest rankedobjects flat. The Zipf-Mandelbrot distribution becomes theZipf distribution when q ¼ 0.Currently, the metric to say a distribution is a powerlaw is to take the loglog plot of the data, and we usuallysay it is a power law if the result shows a straight line.We have to note that this is not a rigorous method, however,it is widely applied in practice. Power law distributionsenjoy one important property, scale free. We referinterested readers to [28] about the power law and itsproperties.4 PROBLEM DESCRIPTIONIn this section, we describe the malware propagation problemin general.As shown in Fig. 2, we study the malware propagationissue at two levels, the Internet level and the network level.We note that at the network level, a network could bedefined in many different ways, it could be an ISP domain,a country network, the group of a specific mobile devices,and so on. At the Internet level, we treat every network ofthe network level as one element.Fig. 1. The impact from infection rate b on the recruitment progress for agiven vulnerable network with N ¼ 10,000.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 173At the Internet level, we suppose, there are M networks,each network is denoted as Lið1 _ i _ MÞ. For anynetwork Li, we suppose it physically possesses Ni hosts.Moreover, we suppose the possibility of vulnerable hostsof Li is denoted as pið0 _ pi _ 1Þ. In general, it is highlypossible that Ni 6¼ Nj, and pi 6¼ pj for i 6¼ j; 1 _ i; j _ M.Moreover, due to differences in network topology, operatingsystem, security investment and so on, the infectionrates are different from network to network. We denote itas bi for Li. Similarly, it is highly possible that bi 6¼ bj fori 6¼ j; 1 _ i; j _ M.For any given network Li with pi _ Ni vulnerable hostsand infection rate bi. We suppose the malware propagationstarts at time 0. Based on Equation (4), we obtain the numberof infected hosts, IiðtÞ, of Li at time t as follows:IiðtÞ ¼ ð1 þ aiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2¼ ð1 þ bipiNiÞIiðt _ 1Þ _ biðIiðt _ 1ÞÞ2:(12)In this paper, we are interested in a global sense of malwarepropagation. We study the following question.For a given time t since the outbreak of a malware, whatare the characteristics of the number of compromised hostsfor each network in the view of the whole Internet. In otherwords, to find a function F about IiðtÞð1 _ i _ MÞ. Namely,the pattern ofFðI1ðtÞ; I2ðtÞ; . . . ; IMðtÞÞ: (13)For simplicity of presentation, we use SðLi; tÞ to replaceIiðtÞ at the network level, and IðtÞ is dedicated for the Internetlevel. Following Equation (13), for any networkLið1 _ i _ MÞ, we haveSðLi; tÞ ¼ ð1 þ bipiNiÞSðLi; t _ 1Þ _ biðSðLi; t _ 1ÞÞ2: (14)At the Internet level, we suppose there are k1; k2; . . . ; ktnetworks that have been compromised at each round foreach time unit from 1 to t. Any kið1 _ i _ tÞ is decided byEquation (4) as follows:ki ¼ ð1 þ bnMÞIði _ 1Þ _ bnðIði _ 1ÞÞ2; (15)where M is the total number of networks over the Internet,and bn is the infection rate among networks. Moreover,suppose the number of seeds, k0, is known.At this time point t, the landscape of the compromisedhosts in terms of networks is as follows.S_L1k1; t_; S_L2k1; t_; . . . ; S_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1S_L1k2; t _ 1_; S_L2k2; t _ 1_; . . . ; S_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflk2. . .S_L1kt; 1_; S_L2kt; 1_; . . . ; S_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt;(16)where Ljkirepresents the jth network that was compromisedat round i. In other words, there are k1 compromised networks,and each of them have progressed t time units; k2compromised networks, and each of them has progressedt _ 1 time units; and kt compromised networks, and each ofthem have progressed 1 time unit.It is natural to have the total number of compromisedhosts at the Internet level asIðtÞ ¼ S_L1k1; t_þ S_L2k1; t_þ_ _ _þS_Lk1k1; t_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k1þ S_L1k2; t _ 1_þ_ _ _þS_Lk2k2; t _ 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}k2þ_ _ _þ S_L1kt; 1_þ S_L2kt; 1_þ_ _ _þS_Lktkt; 1_|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}kt(17)Suppose kiði ¼ 1; 2; . . .Þ follows one distribution with aprobability distribution of pn (n stands for number), andthe size of a compromised network, SðLi; tÞ, followsanother probability distribution of ps (s stands for size).Let pI be the probability distribution of IðtÞðt ¼ 0; 1; . . .Þ.Based on Equation (18), we find pI is exactly the convolutionof pn and ps.pI ¼ pn ps; (18)where is the convolution operation.Our goal is to find a pattern of pI of Equation (18).5 MALWARE PROPAGATION MODELLINGAs shown in Fig. 2, we abstract theM networks of the Internetinto M basic elements in our model. As a result, anytwo large networks, Li and Lj (i 6¼ j), are similar to eachother at this level. Therefore, we can model the studiedproblem as a homogeneous system. Namely, all the M networksshare the same vulnerability probability (denoted asp), and the same infection rate (denoted as b). A simpleway to obtain these two parameters is to use the means:p ¼1MXMi¼1pib ¼1MXMi¼1bi:8>>>><>>>>:(19)Fig. 2. The system architecture of the studied malware propagation.174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015For any network Li, let Ni be the total number of vulnerablehosts, then we haveNi ¼ p _ Ni; i ¼ 1; 2; . . .;M; (20)where Ni is the total number of computers of network Li.As discussed in Section 3, we know that Niði ¼ 1; 2; . . . ;MÞ follows the power law. As p is a constant in Equation(20), then Niði ¼ 1; 2; . . .;MÞ follows the power law as well.Without loss of generality, let Li represent the ith networkin terms of total vulnerable hosts (Ni). Based on the Zipf distribution,if we randomly choose a network X, the probabilitythat it is network Lj isPrfX ¼ Ljg ¼ pzðjÞ ¼N P jMi¼1 Ni ¼Cja : (21)Equation (21) shows clearly that a network with a largernumber of vulnerable hosts has a higher probability to becompromised.Following Equation (18), at time t, we have k1 þ k2 þ_ _ _þkt networks that have been compromised. Combiningwith Equation (21), in general, we know the first round ofrecruitment takes the largest k1 networks, and the secondround takes the k2 largest networks among the remainingnetworks, and so on. We therefore can simplify Equation(18) asIðtÞ ¼Xk1j¼1SðNj; tÞpzðjÞþXk2j¼1SðNk1þj; t _ 1Þpzðk1 þ jÞþ . . .þXktj¼1SðNk1þ___þkt_1þj; 1Þ_ pzðk1 þ_ _ _þkt_1 þ jÞ: (22)From Equation (22), we know the total number of compromisedhosts and their distribution in terms of networksfor a given time point t.6 ANALYSIS ON THE PROPOSED MALWAREPROPAGATION MODELIn this section, we try to extract the pattern of IðtÞ in termsof SðLi; t0 Þ, or pI of Equation (18).We make the following definitions before we progress forthe analysis.1) Early stage. An early stage of the breakout of a malwaremeans only a small percentage of vulnerablehosts have been compromised, and the propagationfollows exponential distributions.2) Final stage. The final stage of the propagation of amalware means that all vulnerable hosts of a givennetwork have been compromised.3) Late stage. A late stage means the time intervalbetween the early stage and the final stage.We note that many researches are focused on the earlystage, and we define the early stage to meet the pervasivelyaccepted condition, we coin the other two terms for theconvenience of our following discussion. Moreover, we setvariable Te as the time point that a malware’s progresstransfers from its early stage to late stage. In terms of mathematicalexpressions, we express the early, late and finalstage as 0 _ t < Te, Te _ t < 1, and t¼1, respectively.Due to the complexity of Equation (22), it is difficult toobtain conclusions in a dynamic style. However, we areable to extract some conclusions under some specialconditions.Lemma 1. If distributions pðxÞ and qðxÞ follow exponential distributions,then pðxÞqðxÞ follows an exponential distributionas well.Due to the space limitation, we skip the proof and referinterested readers to [29].At the early stage of a malware breakout, we have advantagesto obtain a clear conclusion.Theorem 1. For large scale networks, such as the Internet, at theearly stage of a malware propagation, the malware distributionin terms of networks follows exponential distributions.Proof. At a time point of the early stage (0 _ t < Te) of amalware breakout, following Equation (6), we obtain thenumber of compromised networks asIðtÞ ¼ Ið0ÞebnMt: (23)It is clear that IðtÞ follows an exponential distribution.For any of the compromised networks, we suppose ithas progressed t0 ð0 < t0 _ t < Te Þ time units, and itssize isSðLi; t0Þ ¼ Iið0ÞebNit0: (24)Based on Equation (24), we find that the size of anycompromised network follows an exponential distribution.As a result, all the sizes of compromised networksfollow exponential distributions at the early stage.Based on Lemma 1, we obtain that the malware distributionin terms of network follows exponential distributionsat its early stage. tuMoreover, we can obtain concrete conclusion of the propagationof malware at the final stage.Theorem 2. For large scale networks, such as the Internet, at thefinal stage (t¼1) of a malware propagation, the malwaredistribution in terms of networks follows the power lawdistribution.Proof. At the final stage, all vulnerable hosts have beencompromised, namely,SðLi;1Þ ¼ Ni; i ¼ 1; 2; . . .;M:Based on our previous discussion, we know Niði ¼1; 2; . . .;MÞ follows the power law. As a result, the theoremholds. tuNow, we move our study to the late stage of malwarepropagation.Theorem 3. For large scale networks, such as the Internet, at thelate stage (Te _ t < 1) of a malware breakout, the malwaredistribution include two parts: a dominant power law bodyand a short exponential tail.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 175Proof. Suppose a malware propagation has progressed fortðt > > TeÞ time units. Let t0 ¼ t _ Te. If we separate allthe compromised IðtÞ hosts by time point t0, we have twogroups of compromised hosts.Following Theorem 2, as t0 >> Te, the compromisedhosts before t0 follows the power law. At the same time,all the compromised networks after t0 are still in theirearly stage. Therefore, these recently compromised networksfollow exponential distributions.Now, we need to prove that the networks compromisedafter time point t0 are at the tail of the distribution.First of all, for a given network Li, for t1 > t2,we haveSðLi; t1Þ _ SðLi; t2Þ: (25)For two networks, Li and Lj, if Ni _ Nj, then Lishould be compromised earlier than Lj. Combining thiswith (25), we know the later compromised networks usuallylie at the tail of the distribution.Due to the fact that t0 >> Te, the length of the exponentialtail is much shorter than the length of the mainbody of the distribution. tu7 PERFORMANCE EVALUATIONIn this section, we examine our theoretical analysis throughtwo well-known large-scale malware: Android malwareand Conficker. Android malware is a recent fast developingand dominant smartphone based malware [19]. Differentfrom Android malware, the Conficker worm is an Internetbased state-of-the-art botnet [20]. Both the data sets havebeen widely used by the community.From the Android malware data set, we have an overviewof the malware development from August 2010 to October2011. There are 1,260 samples in total from 49 differentAndroid malware in the data set. For a given Android malwareprogram, it only focuses on one or a number of specificvulnerabilities. Therefore, all smartphones share these vulnerabilitiesform a specific network for that Android malware.In other words, there are 49 networks in the data set,and it is reasonable that the population of each network ishuge. We sort the malware subclasses according to their size(number of samples in the data set), and present them in aloglog format in Fig. 3, the diagram is roughly a straight line.In other words, we can say that the Android malware distributionin terms of networks follows the power law.We now examine the growth pattern of total number ofcompromised hosts of Android malware against time,namely, the pattern of IðtÞ. We extract the data from thedata set and present it in Table 2. We further transform thedata into a graph as shown in Fig. 4. It shows that the memberrecruitment of Android malware follows an exponentialdistribution nicely during the 15 months time interval. Wehave to note that our experiments also indicate that thisdata does not fit the power law (we do not show them heredue to space limitation).In Fig. 4, we match a straight line to the real data throughthe least squares method. Based on the data, we can estimatethat the number of seeds (Ið0Þ) is 10, and a ¼ 0:2349.Following our previous discussion, we infer that the propagationof Android malware was in its early stage. It is reasonableas the size of each Android vulnerable network ishuge and the infection rate is quite low (the infection is basicallybased on contacts).We also collected a large data set of Conficker from variousaspects. Due to the space limitation, we can only presenta few of them here to examine our theoretical analysis.First of all, we treat AS as networks in the Internet. Ingeneral, ASs are large scale elements of the Internet. A fewkey statistics from the data set are listed in Table 3. WeFig. 3. The probability distribution of Androidmalware in terms of networks.TABLE 2The Number of Different Android Malware against Time (Months) in 2010-2011Fig. 4. The growth of total compromised hosts by Android malwareagainst time from August 2010 to October 2011.176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 1, JANUARY 2015present the data in a loglog format in Fig. 5, which indicatesthat the distribution does follow the power law.A unique feature of the power law is the scale free property.In order to examine this feature, we measure the compromisedhosts in terms of domain names at three differentdomain levels: the top level, level 1, and level 2, respectively.Some statistics of this experiment are listed inTable 4.Once again, we present the data in a loglog format inFigs. 6a, 6b and 6c, respectively. The diagrams show thatthe main body of the three scale measures are roughlystraight lines. In other words, they all fall into power lawdistributions. We note that the flat head in Fig. 6 can beexplained through a Zipf-Mandelbrot distribution. Therefore,Theorem 2 holds.In order to examine whether the tails are exponential, wetake the smallest six data from each tail of the three levels. Itis reasonable to say that they are the networks compromisedat the last 6 time units, the details are listed in Table 5 (wenote that t ¼ 1 is the sixth last time point, and t ¼ 6 is thelast time point).When we present the data of Table 5 into a graph asshown in Fig. 7, we find that they fit an exponential distributionvery well, especially for the level 2 and level 3domain name cases. This experiment confirms our claimin Theorem 3.8 FURTHER DISCUSSIONIn this paper, we have explored the problem of malwaredistribution in large-scale networks. There are many directionsthat could be further explored. We list some importantones as follows.1) The dynamics of the late stage. We have found thatthe main body of malware distribution follows thepower law with a short exponential tail at the latestage. It is very attractive to explore the mathematicalmechanism of how the propagation leads to suchkinds of mixed distributions.2) The transition from exponential distribution topower law distribution. It is necessary to investigatewhen and how a malware distribution moves froman exponential distribution to the power law. Inother words, how can we clearly define the transitionpoint between the early stage and the late stage.3) Multiple layer modelling. We hire the fluid model inboth of the two layers in our study as both layers aresufficiently large and meet the conditions for themodelling methods. In order to improve the accuracyof malware propagation, we may extend ourwork to nðn > 2Þ layers. In another scenario, weTABLE 3Statistics for Conficker Distribution in Terms of ASsFig. 5. Power law distribution of Conficker in terms of autonomousnetworks.TABLE 4Statistics for Conficker Distribution in Terms of DomainNames at the Three Top LevelsFig. 6. Power law distribution of Conficker botnet in the top three levels of domain names.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 177may expect to model a malware distribution for middlesize networks, e.g., an ISP network with manysubnetworks. In these cases, the conditions for thefluid model may not hold. Therefore, we need toseek suitable models to address the problem.4) Epidemic model for the proposed two layer model.In this paper, we use the SI model, which is thesimplest for epidemic analysis. More practical models,e.g., SIS or SIR, could be chosen to serve thesame problem.5) Distribution of coexist multiple malware in networks.In reality, multiple malware may coexist atthe same networks. Due to the fact that different malwarefocus on different vulnerabilities, the distributionsof different malware should not be the same. Itis challenging and interesting to establish mathematicalmodels for multiple malware distribution interms of networks.9 SUMMARY AND FUTURE WORKIn this paper, we thoroughly explore the problem of malwaredistribution at large-scale networks. The solution tothis problem is desperately desired by cyber defenders asthe network security community does not yet have solidanswers. Different from previous modelling methods, wepropose a two layer epidemic model: the upper layerfocuses on networks of a large scale networks, for example,domains of the Internet; the lower layer focuses on the hostsof a given network. *This two layer model improves theaccuracy compared with the available single layer epidemicmodels in malware modelling. Moreover, the proposed twolayer model offers us the distribution of malware in termsof the low layer networks.We perform a restricted analysis based on the proposedmodel, and obtain three conclusions: The distribution for agiven malware in terms of networks follows exponentialdistribution, power law distribution with a short exponentialtail, and power law distribution, at its early, late, andfinal stage, respectively. In order to examine our theoreticalfindings, we have conducted extensive experiments basedon two real-world large-scale malware, and the results confirmour theoretical claims.In regards to future work, we will first further investigatethe dynamics of the late stage. More details of the findingsare expected to be further studied, such as the length of theexponential tail of a power law distribution at the late stage.Second, defenders may care more about their own network,e.g., the distribution of a given malware at their ISPdomains, where the conditions for the two layer model maynot hold. We need to seek appropriate models to addressthis problem. Finally, we are interested in studying the distributionof multiple malware on large-scale networks aswe only focus on one malware in this paper. We believe it isnot a simple linear relationship in the multiple malwarecase compared to the single malware one.ACKNOWLEDGMENTSDr Yu’s work is partially supported by the National NaturalScience Foundation of China (grant No. 61379041), Prof.Stojmenovic’s work is partially supported by NSERCCanada Discovery grant (grant No. 41801-2010), and KAUDistinguished Scientists Program.Shui Yu (M’05-SM’12) received the BEng andMEng degrees from the University of ElectronicScience and Technology of China, Chengdu, P.R. China, in 1993 and 1999, respectively, andthe PhD degree from Deakin University, Victoria,Australia, in 2004. He is currently a senior lecturerwith the School of Information Technology,Deakin University, Victoria, Australia. He haspublished nearly 100 peer review papers, includingtop journals and top conferences, such asIEEE TPDS, IEEE TIFS, IEEE TFS, IEEE TMC,and IEEE INFOCOM. His research interests include networking theory,network security, and mathematical modeling. His actively servers hisresearch communities in various roles, which include the editorial boardsof the IEEE Transactions on Parallel and Distributed Systems, IEEECommunications Surveys and Tutorials, and IEEE Access, IEEE INFOCOMTPC members 2012-2015, symposium co-chairs of IEEE ICC2014, IEEE ICNC 2013-2015, and many different roles of internationalconference organizing committees. He is a senior member of the IEEE,and a member of the AAAS.Guofei Gu (S’06-M’08) received the PhD degreein computer science from the College of Computing,Georgia Institute of Technology. He is anassistant professor in the Department of ComputerScience and Engineering, Texas A&M University(TAMU), College Station, TX. Hisresearch interests are in network and systemsecurity, such as malware analysis, detection,defense, intrusion and anomaly detection, andweb and social networking security. He is currentlydirecting the Secure Communication andComputer Systems (SUCCESS) Laboratory at TAMU. He received the2010 National Science Foundation (NSF) Career Award and a corecipientof the 2010 IEEE Symposium on Security and Privacy (Oakland 10)Best Student Paper Award. He is a member of the IEEE.Ahmed Barnawi received the PhD degree fromthe University of Bradford, United Kingdom in2006. He is an associate professor at the Facultyof Computing and IT, King Abdulaziz University,Jeddah, Saudi Arabia, where he works since2007. He was visiting professor at the Universityof Calgary in 2009. His research areas are cellularand mobile communications, mobile ad hocand sensor networks, cognitive radio networksand security. He received three strategicresearch grants and registered two patents in theUS. He is a member of the IEEE.Song Guo (M’02-SM’11) received the PhDdegree in computer science from the Universityof Ottawa, Canada in 2005. He is currently asenior associate professor at the School of ComputerScience and Engineering, the University ofAizu, Japan. His research interests are mainly inthe areas of protocol design and performanceanalysis for reliable, energy-efficient, and costeffective communications in wireless networks.He is an associate editor of the IEEE Transactionson Parallel and Distributed Systems and aneditor of Wireless Communications and Mobile Computing. He is asenior member of the IEEE and the ACM.Ivan Stojmenovic was editor-in-chief of theIEEE Transactions on Parallel and DistributedSystems (2010-3), and is founder of three journals.He is editor of the IEEE Transactions onComputers, IEEE Network, IEEE Transactionson Cloud Computing, and ACM Wireless Networksand steering committee member of theIEEE Transactions on Emergent Topics in Computing.He is on Thomson Reuters list of HighlyCited Researchers from 2013, has top h-index inCanada for mathematics and statistics, and hasmore than 15,000 citations. He received five Best Paper Awards. He is afellow of the IEEE, Canadian Academy of Engineering and AcademiaEuropaea. He has received the Humboldt Research Award.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.YU ET AL.: MALWARE PROPAGATION IN LARGE-SCALE NETWORKS 179

Lossless and Reversible Data Hiding in Encrypted Images with Public Key Cryptography

05/08/201902/07/2019 by admin

Abstract—This paper proposes a lossless, a reversible, and a combined data hiding schemes for ciphertext images encrypted by public key cryptosystems with probabilistic and homomorphic properties. In the lossless scheme, the ciphertext pixels are replaced with new values to embed the additional data into several LSB-planes of ciphertext pixels by multi-layer wet paper coding. Then, the embedded data can be directly extracted from the encrypted domain, and the data embedding operation does not affect the decryption of original plaintext image. In the reversible scheme, a preprocessing is employed to shrink the image histogram before image encryption, so that the modification on encrypted images for data embedding will not cause any pixel oversaturation in plaintext domain. Although a slight distortion is introduced, the embedded data can be extracted and the original image can be recovered from the directly decrypted image. Due to the compatibility between the lossless and reversible schemes, the data embedding operations in the two manners can be simultaneously performed in an encrypted image. With the combined technique, a receiver may extract a part of embedded data before decryption, and extract another part of embedded data and recover the original plaintext image after decryption.

Index Terms—reversible data hiding, lossless data hiding, image encryption

I. INTRODUCTION

ncryption and data hiding are two effective means of data protection. While the encryption techniques convert plaintext content into unreadable ciphertext, the data hiding techniques embed additional data into cover media by introducing slight modifications. In some distortion-unacceptable scenarios, data hiding may be performed with a lossless or reversible manner. Although the terms “lossless” and “reversible” have a same meaning in a set of previous references, we would distinguish them in this work.

We say a data hiding method is lossless if the display of cover signal containing embedded data is same as that of original cover even though the cover data have been modified for data embedding. For example, in [1], the pixels with the most used color in a palette image are assigned to some unused color indices for carrying the additional data, and these indices are redirected to the most used color. This way, although the indices of these pixels are altered, the actual colors of the pixels are kept unchanged. On the other hand, we say a data hiding method is reversible if the original cover content can be perfectly recovered from the cover version containing embedded data even though a slight distortion has been introduced in data embedding procedure. A number of mechanisms, such as difference expansion [2], histogram shift [3] and lossless compression [4], have been employed to develop the reversible data hiding techniques for digital images. Recently, several good prediction approaches [5] and optimal transition probability under payload-distortion criterion [6, 7] have been introduced to improve the performance of reversible data hiding.

Combination of data hiding and encryption has been studied in recent years. In some works, data hiding and encryption are jointed with a simple manner. For example, a part of cover data is used for carrying additional data and the rest data are encrypted for privacy protection [8, 9]. Alternatively, the additional data are embedded into a data space that is invariable to encryption operations [10]. In another type of the works, data embedding is performed in encrypted domain, and an authorized receiver can recover the original plaintext cover image and extract the embedded data. This technique is termed as reversible data hiding in encrypted images (RDHEI). In some scenarios, for securely sharing secret images, a content owner may encrypt the images before transmission, and an inferior assistant or a channel administrator hopes to append some additional messages, such as the origin information, image notations or authentication data, within the encrypted images though he does not know the image content. For example, when medical images have been encrypted for protecting the patient privacy, a database administrator may aim to embed the personal information into the corresponding encrypted images. Here, it may be hopeful that the original content can be recovered without any error after decryption and retrieve of additional message at receiver side. In [11], the original image is encrypted by an exclusive-or operation with pseudo-random bits, and then the additional data are embedded by flipping a part of least significant bits (LSB) of encrypted image. By exploiting the spatial correlation in natural images, the embedded data and the original content can be retrieved at receiver side. The performance of RDHEI can be further

Lossless and Reversible Data Hiding in Encrypted Images with Public Key Cryptography

Xinpeng Zhang, Jing Long, Zichi Wang, and Hang Cheng 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

improved by introducing an implementation order [12] or a flipping ratio [13]. In [14], each additional bit is embedded into a block of data encrypted by the Advanced Encryption Standard (AES). When a receiver decrypts the encrypted image containing additional data, however, the quality of decrypted image is significantly degraded due to the disturbance of additional data. In [15], the data-hider compresses the LSB of encrypted image to generate a sparse space for carrying the additional data. Since only the LSB is changed in the data embedding phase, the quality of directly decrypted image is satisfactory. Reversible data hiding schemes for encrypted JPEG images is also presented [16]. In [17], a sparse data space for accommodating additional data is directly created by compress the encrypted data. If the creation of sparse data space or the compression is implemented before encryption, a better performance can be achieved [18, 19].

While the additional data are embedded into encrypted images with symmetric cryptosystem in the above-mentioned RDHEI methods, a RDHEI method with public key cryptosystem is proposed in [20]. Although the computational complexity is higher, the establishment of secret key through a secure channel between the sender and the receiver is needless. With the method in [20], each pixel is divided into two parts: an even integer and a bit, and the two parts are encrypted using Paillier mechanism [21], respectively. Then, the ciphertext values of the second parts of two adjacent pixels are modified to accommodate an additional bit. Due to the homomorphic property of the cryptosystem, the embedded bit can be extracted by comparing the corresponding decrypted values on receiver side. In fact, the homomorphic property may be further exploited to implement signal processing in encrypted domain [22, 23, 24]. For recovering the original plaintext image, an inverse operation to retrieve the second part of each pixel in plaintext domain is required, and then two decrypted parts of each pixel should be reorganized as a pixel.

This paper proposes a lossless, a reversible, and a combined data hiding schemes for public-key-encrypted images by exploiting the probabilistic and homomorphic properties of cryptosystems. With these schemes, the pixel division/reorganization is avoided and the encryption/decryption is performed on the cover pixels directly, so that the amount of encrypted data and the computational complexity are lowered. In the lossless scheme, due to the probabilistic property, although the data of encrypted image are modified for data embedding, a direct decryption can still result in the original plaintext image while the embedded data can be extracted in the encrypted domain. In the reversible scheme, a histogram shrink is realized before encryption so that the modification on encrypted image for data embedding does not cause any pixel oversaturation in plaintext domain. Although the data embedding on encrypted domain may result in a slight distortion in plaintext domain due to the homomorphic property, the embedded data can be extracted and the original content can be recovered from the directly decrypted image. Furthermore, the data embedding operations of the lossless and the reversible schemes can be simultaneously performed in an encrypted image. With the combined technique, a receiver may extract a part of embedded data before decryption, and extract another part of embedded data and recover the original plaintext image after decryption.

II. LOSSLESS DATA HIDING SCHEME

In this section, a lossless data hiding scheme for public-key-encrypted images is proposed. There are three parties in the scheme: an image provider, a data-hider, and a receiver. With a cryptosystem possessing probabilistic property, the image provider encrypts each pixel of the original plaintext image using the public key of the receiver, and a data-hider who does not know the original image can modify the ciphertext pixel-values to embed some additional data into the encrypted image by multi-layer wet paper coding under a condition that the decrypted values of new and original cipher-text pixel values must be same. When having the encrypted image containing the additional data, a receiver knowing the data hiding key may extract the embedded data, while a receiver with the private key of the cryptosystem may perform decryption to retrieve the original plaintext image. In other words, the embedded data can be extracted in the encrypted domain, and cannot be extracted after decryption since the decrypted image would be same as the original plaintext image due to the probabilistic property. That also means the data embedding does not affect the decryption of the plaintext image. The sketch of lossless data hiding scheme is shown in Figure 1.

Data extraction

Decryption

Additional data

Data embedding

Image encryption

Original image

Additional data

Receiver

Encrypted image containing embedded data

Data-hider

Image provider

Encrypted image

Figure 1. Sketch of lossless data hiding scheme for public-key-encrypted images 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

A. Image encryption

In this phase, the image provider encrypts a plaintext image using the public key of probabilistic cryptosystem pk. For each pixel value m(i, j) where (i, j) indicates the pixel position, the image provider calculates its ciphertext value,

(1) ()()()[]jirjimpEjick,,,,,=

where E is the encryption operation and r(i, j) is a random value. Then, the image provider collects the ciphertext values of all pixels to form an encrypted image.

Actually, the proposed scheme is capitable with various probabilistic public-key cryptosystems, such as Paillier [18] and Damgard-Jurik cryptosystems [25]. With Paillier cryptosystem [18], for two large primes p and q, calculate n = p⋅q, λ = lcm (p−1, q−1), where lcm means the least common multiple. Here, it should meet that gcd (n, (p−1)⋅(q−1)) = 1, where gcd means the greatest common divisor. The public key is composed of n and a randomly selected integer g in , while the private key is composed of λ and 2*nZ

(2) ()()nngLmodmod12−=λμ

where

(3) ()()nxxL1−=

In this case, (1) implies

(4) ()()()()2,mod,,njirgjicnjim⋅=

where r(i, j) is a random integer in Z*n. The plaintext pixel value can be obtained using the private key,

(5) ()()()()nnjicLjimmodmod,,2μλ⋅=

As a generalization of Paillier cryptosystem, Damgard-Jurik cryptosystem [25] can be also used to encrypt the plaintext image. Here, the public key is composed of n and an element g in such that g = (1+n)j⋅x mod ns+1 for a known j relatively prime to n and x belongs to a group isomorphic to Z*n, and we may choose d as the private key when meeting d mod n ∈ Z*n and d = 0 mod λ. Then, the encryption in (1) can be rewritten as1*+snZ

(6) ()()()()1,mod,,+⋅=snjimnjirgjics

where r(i, j) is a random integer in . By applying a recursive version of Paillier decryption, the plaintext value can be obtained from the ciphertext value using the private key. Note that, because of the probabilistic property of the two cryptosystems, the same gray values at different positions may correspond to different ciphertext values. 1*+snZ

B. Data embedding

When having the encrypted image, the data-hider may embed some additional data into it in a lossless manner. The pixels in the encrypted image are reorganized as a sequence according to the data hiding key. For each encrypted pixel, the data-hider selects a random integer r’(i, j) in Z*n and calculates

(7) ()()()()2mod,’,,’njirjicjicn⋅=

if Paillier cryptosystem is used for image encryption, while the data-hider selects a random integer r’(i, j) in and calculates 1*+snZ

(8) ()()()()1mod,’,,’+⋅=ssnnjirjicjic

if Damgard-Jurik cryptosystem is used for image encryption. We denote the binary representations of c(i, j) and c’(i, j) as bk(i, j) and b’k(i, j), respectively,

(9) ()()…,2,1,2mod2,,1==−kjicjibkk

(10) ()()…,2,1,2mod2,’,’1==−kjicjibkk

Clearly, the probability of bk(i, j) = b’k(i, j) (k = 1, 2, …) is 1/2. We also define the sets

()()(){}()()()()(){}()()()()(){}1…,,2,1,,’,,,’,|,,’,,,’,|,,’,|,11222111−==≠==≠=≠=KkjibjibjibjibjiSjibjibjibjibjiSjibjibjiSkkKKK

(11)

By viewing the k-th LSB of encrypted pixels as a wet paper channel (WPC) [26] and the k-th LSB in Sk as “dry” elements of the wet paper channel, the data-hider may employ the wet paper coding [26] to embed the additional data by replacing a part of c(i, j) with c’(i, j). The details will be given in the following.

Considering the first LSB, if c(i, j) are replaced with c’(i, j), the first LSB in S1 would be flipped and the rest first LSB would be unchanged. So, the first LSB of the encrypted pixels can be regarded as a WPC, which includes changeable (dry) elements and unchangeable (wet) elements. In other words, the first LSB in S1 are dry elements and the rest first LSB are wet positions. By using the wet paper coding [26], one can represent on average Nd bits by only flipping a part of dry elements where Nd is the number of dry elements. In this scenario, the data-hider may flip the dry elements by replacing c(i, j) with c’(i, j). Denoting the number of pixels in the image as N, the data-hider may embed on average N/2 bits in the first LSB-layer using wet paper coding.

Considering the second LSB (SLSB) layer, we call the SLSB in S2 as dry elements and the rest SLSB as wet elements. Note that the first LSB of ciphertext pixels in S1 have been determined by replacing c(i, j) with c’(i, j) or keeping c(i, j) unchanged in the first LSB-layer embedding, meaning that the SLSB in S1 are unchangeable in the second layer. Then, the data-hider may flip a part of SLSB in S2 by replacing c(i, j) with c’(i, j) to embed on average N/4 bits using wet paper coding.

Similarly, in the k-th LSB layer, the data-hider may flip a part of k-th LSB in Sk to embed on average N/2k bits. When the data embedding is implemented in K layers, the total N⋅(1−1/2K) bits, on average, are embedded. That implies the embedding rate, a ratio between the number of embedded bits and the number of pixels in cover image, is approximately (1−1/2K). That implies the upper bound of the embedding rate is 1 bit per pixel. The next subsection will show that, although a part of c(i, j) is replaced with c’(i, j), the original plaintext image can still be obtained by decryption.1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

C. Data extraction and image decryption

After receiving an encrypted image containing the additional data, if the receiver knows the data-hiding key, he may calculate the k-th LSB of encrypted pixels, and then extract the embedded data from the K LSB-layers using wet paper coding. On the other hand, if the receiver knows the private key of the used cryptosystem, he may perform decryption to obtain the original plaintext image. When Paillier cryptosystem is used, Equation (4) implies

(12) ()()()()2,,,njirgjicnjim⋅+⋅=α

where α is an integer. By substituting (12) into (7), there is

(13) ()()()()()2,mod,’,,’njirjirgjicnjim⋅⋅=

Since r(i, j)⋅r’(i, j) can be viewed as another random integer in Z*n, the decryption on c’(i, j) will result in the plaintext value,

(14) ()()()()nnjicLjimmodmod,’,2μλ⋅=

Similarly, when Damgard-Jurik cryptosystem is used,

(15) ()()()()()1,mod,’,,’+⋅⋅=snjimnjirjirgjics

The decryption on c’(i, j) will also result in the plaintext value. In other words, the replacement of ciphertext pixel values for data embedding does not affect the decryption result.

III. REVERSIBLE DATA HIDING SCHEME

This section proposes a reversible data hiding scheme for public-key-encrypted images. In the reversible scheme, a preprocessing is employed to shrink the image histogram, and then each pixel is encrypted with additive homomorphic cryptosystem by the image provider. When having the encrypted image, the data-hider modifies the ciphertext pixel values to embed a bit-sequence generated from the additional data and error-correction codes. Due to the homomorphic property, the modification in encrypted domain will result in slight increase/decrease on plaintext pixel values, implying that a decryption can be implemented to obtain an image similar to the original plaintext image on receiver side. Because of the histogram shrink before encryption, the data embedding operation does not cause any overflow/underflow in the directly decrypted image. Then, the original plaintext image can be recovered and the embedded additional data can be extracted from the directly decrypted image. Note that the data-extraction and content-recovery of the reversible scheme are performed in plaintext domain, while the data extraction of the previous lossless scheme is performed in encrypted domain and the content recovery is needless. The sketch of reversible data hiding scheme is given in Figure 2.

Decryption

Histogram shrink

Data extraction & image recovery

Data embedding

Image encryption

Original image

Additional data

Receiver

Data-hider

Image provider

Encrypted image

Additional data

Encrypted image containing embedded data

Figure 2. Sketch of reversible data hiding scheme for public-key-encrypted images

A. Histogram shrink and image encryption

In the reversible scheme, a small integer δ shared by the image provider, the data-hider and the receiver will be used, and its value will be discussed later. Denote the number of pixels in the original plaintext image with gray value v as hv, implying

(16) Nhvv=Σ=2550

where N is the number of all pixels in the image. The image provider collects the pixels with gray values in [0, δ+1], and represent their values as a binary stream BS1. When an efficient lossless source coding is used, the length of BS1

(17) ⋅≈ΣΣΣΣ+=++=+=+=101101100101,…,,δδδδδvvvvvvvvhhhhhhHhl

where H(⋅) is the entropy function. The image provider also collects the pixels with gray values in [255−δ, 255], and represent their values as a binary stream BS2 with a length l2. Similarly,

(18) ⋅≈ΣΣΣΣ−=−=+−−=−−=25525525525525512552552552552552552,…,,δδδδδδvvvvvvvvhhhhhhHhl

Then, the gray values of all pixels are enforced into [δ+1, 255−δ],

()()()()()+≤+−<<+−≥−=1,if,1255,1if,,255,if,255,δδδδδδjimjimjimjimjimS

(19)

Denoting the new histogram as h’v, there must be 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

(20) −>−=−<<++=≤=ΣΣ−=+=δδδδδδδδ255,0255,2551,1,,0’25525510vvhvhvhvhvvvvvv

The image provider finds the peak of the new histogram,

(21) vvhV‘maxarg2551δδ−≤≤+=

The image provider also divides all pixels into two sets: the first set including (N−8) pixels and the second set including the rest 8 pixels, and maps each bit of BS1, BS2 and the LSB of pixels in the second set to a pixel in the first set with gray value V. Since the gray values close to extreme black/white are rare, there is

(22) 16’21++≥llhV

when δ is not too large. In this case, the mapping operation is feasible. Here, 8 pixels in the second set cannot be used to carry BS1/BS2 since their LSB should be used to carry the value of V, while 8 pixels in the first set cannot be used to carry BS1/BS2 since their LSB should be used to carry the original LSB of the second set. So, a total of 16 pixels cannot be used for carrying BS1/BS2. That is the reason that there is a value 16 in (22). The experimental result on 1000 natural images shows (22) is always right when δ is less than 15. So, we recommend the parameter δ < 15. Then, a histogram shift operation is made,

()()()()()()()<−=−=>=VjimjimVjimVVjimVVjimjimjimSSSSSST,if,1,1isbitingcorrespondtheand,if,10isbitingcorrespondtheand,if,,if,,,

(23)

In other word, BS1, BS2 and the LSB of pixels in the second set are carried by the pixels in the first set. After this, the image provider represents the value of V as 8 bits and maps them to the pixels in the second set in a one-to-one manner. Then, the values of pixels in the second set are modified as follows,

()()()()()−=bitingcorrespondthefrom differs,ofLSBif,1,bitingcorrespondtheas same is,ofLSBif,,,jimjimjimjimjimSSSST

(24)

That means the value of V is embedded into the LSB of the second set. This way, all pixel values must fall into [δ, 255−δ].

At last, the image provider encrypts all pixels using a public key cryptosystem with additive homomorphic property, such as Paillier and Damgard-Jurik cryptosystems. When Paillier cryptosystem is used, the ciphertext pixel is

(25) ()()()()2,mod,,njirgjicnjimT⋅=

And, when Damgard-Jurik cryptosystem is used, the ciphertext pixel is

(26) ()()()()1,mod,,+⋅=snjimnjirgjicsT

Then, the ciphertext values of all pixels are collected to form an encrypted image.

B. Data embedding

With the encrypted image, the data-hider divides the ciphertext pixels into two set: Set A including c(i, j) with odd value of (i+j), and Set B including c(i, j) with even value of (i+j). Without loss of generality, we suppose the pixel number in Set A is N/2. Then, the data-hider employs error-correction codes expand the additional data as a bit-sequence with length N/2, and maps the bits in the coded bit-sequence to the ciphertext pixels in Set A in a one-to-one manner, which is determined by the data-hiding key. When Paillier cryptosystem is used, if the bit is 0, the corresponding ciphertext pixel is modified as

(27) ()()()()2mod,’,,’njirgjicjicnn⋅⋅=−δ

where r’(i, j) is a integer randomly selected in Z*n. If the bit is 1, the corresponding ciphertext pixel is modified as

(28) ()()()()2mod,’,,’njirgjicjicn⋅⋅=δ

When Damgard-Jurik cryptosystem is used, if the bit is 0, the corresponding ciphertext pixel is modified as

(29) ()()()()1mod,’,,’1+−⋅⋅=+snnnjirgjicjicssδ

where r’(i, j) is a integer randomly selected in . If the bit is 1, the corresponding ciphertext pixel is modified as 1*+snZ

(30) ()()()()1mod,’,,’+⋅⋅=snnjirgjicjicsδ

This way, an encrypted image containing additional data is produced. Note that the additional data are embedded into Set A. Although the pixels in Set B may provide side information of the pixel-values in Set A, which will be used for data extraction, the pixel-values in Set A are difficult to be precisely obtained on receiver side, leading to possible errors in directly extracted data. So, the error-correction coding mechanism is employed here to ensure successful data extraction and perfect image recovery.

C. Image decryption, data extraction and content recovery

After receiving an encrypted image containing additional data, the receiver firstly performs decryption using his private key. We denote the decrypted pixels as m’(i, j). Due to the homomorphic property, the decrypted pixel values in Set A meet

()()()−+=0isbitingcorrespondtheif,,1isbitingcorrespondtheif,,,’δδjimjimjimTT

(31)

On the other hand, the decrypted pixel values in Set B are just mT(i, j) since their ciphertext values are unchanged in data embedding phase. When δ is small, the decrypted image is perceptually similar to the original plaintext image.

Then, the receiver with the data-hiding key can extract the embedded data from the directly decrypted image. He estimates the pixel values in Set A using their neighbors,

()()()()()41,,11,,1,++++−+−=jimjimjimjimjimTTTTT

(32)

and obtain an estimated version of the coded bit-sequence by comparing the decrypted and estimated pixel values in Set A. That means the coded bit is estimated as 0 if or as 1 if . With the estimate of coded ()()jimjimT,’,> ()()jimjimT,’,≤1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

bit-sequence, the receiver may employ the error-correction method to retrieve the original coded bit-sequence and the embedded additional data. Note that, with a larger δ, the error rate in the estimate of coded bits would be lower, so that more additional data can be embedded when ensuring successful error correction and data extraction. In other words, a smaller δ would result in a higher error rate in the estimate of coded bits, so that the error correction may be unsuccessful when excessive payload is embedded. That means the embedding capacity of the reversible data hiding scheme is depended on the value of δ.

After retrieving the original coded bit-sequence and the embedded additional data, the original plaintext image may be further recovered. For the pixels in Set A, mT(i, j) are retrieved according to the coded bit-sequence,

()()()+−=0isbitingcorrespondtheif,,’1isbitingcorrespondtheif,,’,δδjimjimjimT

(33)

For the pixels in Set B, as mentioned above, mT(i, j) are just m’(i, j). Then, divides all mT(i, j) into two sets: the first one including (N−8) pixels and the second one including the rest 8 pixels. The receiver may obtain the value of V from the LSB in the second set, and retrieve mS(i, j) of the first set,

(34) ()()()()()()−<+−=>=1,if,1,1or,if,,if,,,VjimjimVVjimVVjimjimjimTTTTTS

Meanwhile, the receiver extracts a bit 0 from a pixel with mT(i, j) = V and a bit 1 from a pixel with mT(i, j) = V−1. After decomposing the extracted data into BS1, BS2 and the LSB of mS(i, j) in the second set, the receiver retrieves mS(i, j) of the second set,

()()()()()()()+=differentare,and,ofLSBif,1,sameare, and,ofLSBif,,,jimjimjimjimjimjimjimTSTTSTS

(35)

Collect all pixels with mS(i, j) = δ+1, and, according to BS1, recover their original values within [0, δ+1]. Similarly, the original values of pixels with mS(i, j) = 255−δ are recovered within [255−δ, 255] according to BS2. This way, the original plaintext image is recovered.

IV. COMBINED DATA HIDING SCHEME

As described in Sections 3 and 4, a lossless and a reversible data hiding schemes for public-key-encrypted images are proposed. In both of the two schemes, the data embedding operations are performed in encrypted domain. On the other hand, the data extraction procedures of the two schemes are very different. With the lossless scheme, data embedding does not affect the plaintext content and data extraction is also performed in encrypted domain. With the reversible scheme, there is slight distortion in directly decrypted image caused by data embedding, and data extraction and image recovery must be performed in plaintext domain. That implies, on receiver side, the additional data embedded by the lossless scheme cannot be extracted after decryption, while the additional data embedded by the reversible scheme cannot extracted before decryption. In this section, we combine the lossless and reversible schemes to construct a new scheme, in which data extraction in either of the two domains is feasible. That means the additional data for various purposes may be embedded into an encrypted image, and a part of the additional data can be extracted before decryption and another part can be extracted after decryption.

In the combined scheme, the image provider performs histogram shrink and image encryption as described in Subsection 3.A. When having the encrypted image, the data-hider may embed the first part of additional data using the method described in Subsection 3.B. Denoting the ciphertext pixel values containing the first part of additional data as c’(i, j), the data-hider calculates

(36) ()()()()2mod,”,’,”njirjicjicn⋅=

(37) ()()()()1mod,”,’,”+⋅=snnjirjicjics

where r”(i, j) are randomly selected in Z*n or for Paillier and Damgard-Jurik cryptosystems, respectively. Then, he may employ wet paper coding in several LSB-planes of ciphertext pixel values to embed the second part of additional data by replacing a part of c’(i, j) with c”(i, j). In other words, the method described in Subsection 2.B is used to embed the second part of additional data. On receiver side, the receiver firstly extracts the second part of additional data from the LSB-planes of encrypted domain. Then, after decryption with his private key, he extracts the first part of additional data and recovers the original plaintext image from the directly decrypted image as described in Subsection 3.C. The sketch of the combined scheme is shown in Figure 3. Note that, since the reversibly embedded data should be extracted in the plaintext domain and the lossless embedding does not affect the decrypted result, the lossless embedding should implemented after the reversible embedding in the combined scheme.1*+snZ 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems for Video Technology

Data extraction in encrypted domain

Lossless data embedding

Decryption

Histogram shrink

Data extraction & image recovery in plaintext domain

Reversible data embedding

Image encryption

Original image

Receiver

Data-hider

Image provider

Encrypted image

Additional data 1

Encrypted image containing embedded data

Additional data 2

Additional data 1

Additional data 2

Original image

Figure 3. Sketch of combined scheme

V. EXPERIMENTAL RESULTS

Four gray images sized 512×512, Lena, Man, Plane and Crowd, shown in Figure 4, and 50 natural gray images sized 1920×2560, which contain landscape and people, were used as the original plaintext covers in the experiment. With the lossless scheme, all pixels in the cover images were firstly encrypted using Paillier cryptosystem, and then the additional data were embedded into the LSB-planes of ciphertext pixel-values using multi-layer wet paper coding as in Subsection 2.B. Table 1 lists the average value of embedding rates when K LSB-planes were used for carrying the additional data in the 54 encrypted images. In fact, the average embedding rate is very close to (1−1/2K). On receiver side, the embedded data can be extracted from the encrypted domain. Also, the original plaintext images can be retrieved by direct decryption. In other word, when the decryption was performed on the encrypted images containing additional data, the original plaintext images were obtained.

With the reversible scheme, all pixels were encrypted after histogram shrink as in Subsection 3.A. Then, a half of ciphertext pixels were modified to carry the additional data as in Subsection 3.B, and after decryption, we implemented the data extraction and image recovery in the plaintext domain. Here, the low-density parity-check (LDPC) coding was used to expand the additional data as a bit-sequence in data embedding phase, and to retrieve the coded bit-sequence and the embedded additional data on receiver side. Although the error-correction mechanism was employed, an excessive payload may cause the failure of data extraction and image recovery. With a larger value of δ, a higher embedding capacity could be ensured, while a higher distortion would be introduced into the directly decrypted image. For instance, when using Lena as the cover and δ = 4, a total of 4.6×104 bits were embedded and the value of PSNR in directly decrypted image was 40.3 dB. When using δ = 7, a total of 7.7×104 bits were embedded and the value of PSNR in directly decrypted image was 36.3 dB. In both of the two cases, the embedded additional data and the original plaintext image were extracted and recovered without any error. Figure 5 gives the two directly decrypted images. Figure 6 shows the rate-distortion curves generated from different cover images and various values of δ under the condition of successful data-extraction/image-recovery. The abscissa represents the pure embedding rate, and the ordinate is the PSNR value in directly decrypted image. The rate-distortion curves on four test images, Lena, Man, Plane and Crowd, are given in Figures 6, respectively. We also used 50 natural gray images sized 1920×2560 as the original plaintext covers, and calculated the average values of embedding rates and PSNR values, which are also shown as a curve marked by asterisks in the figure. Furthermore, Figure 7 compares the average rate-PSNR performance between the proposed reversible scheme with public-key cryptosystems and several previous methods with symmetric cryptosystems under a condition that the original plaintext image can be recovered without any error using the data-hiding and encryption keys. In [11] and [12], each block of encrypted image with given size is used to carry one additional bit. So, the embedding rates of the two works are fixed and low. With various parameters, we obtain the performance curves of the method in [15] and the proposed reversible scheme, which are shown in the figure. It can be seen that the proposed reversible scheme significantly outperforms the previous methods when the embedding rate is larger than 0.01 bpp. With the combined scheme, we implemented the histogram shrink operation with a value of parameter δ, and encrypted the 1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2433194, IEEE Transactions on Circuits and Systems

Joint Beamforming, Power and Channel Allocation in Multi-User and Multi-Channel Underlay MISO Cogniti

05/08/201902/07/2019 by admin

Efficient Top-k Retrieval on Massive Data

05/08/201902/07/2019 by admin

Abstract:

Top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed.

Introduction:

Top-k query is an important operation to return a set of interesting points from a potentially huge data space. In top-k query, a ranking function F is provided to determine the score of each tuple and k tuples with the largest scores are returned. Due to its practical importance, top-k query has attracted extensive attention proposes a novel table-scan-based T2S algorithm (Top-k by Table Scan) to compute top-k results on massive data efficiently.

The analysis of scan depth in T2S is developed also. The result size k is usually small and the vast majority of the tuples retrieved in PT are not top-k results, this paper devises selective retrieval to skip the tuples in PT which are not query results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly.

The construction and incremental-update/batch-processing methods for the data structures are proposed in this paper. The extensive experiments are conducted on synthetic and real life data sets.

Existing System:

To its practical importance, top-k query has attracted extensive attention. The existing top-k algorithms can be classified into three types: indexbased methods view-based methods and sorted-list-based methods . Index-based methods (or view-based methods) make use of the pre-constructed indexes or views to process top-k query.

A concrete index or view is constructed on a specific subset of attributes, the indexes or views of exponential order with respect to attribute number have to be built to cover the actual queries, which is prohibitively expensive. The used indexes or views can only be built on a small and selective set of attribute combinations.

Sorted-list-based methods retrieve the sorted lists in a round-robin fashion, maintain the retrieved tuples, update their lower-bound and upper-bound scores. When the kth largest lower-bound score is not less than the upper-bound scores of other candidates, the k candidates with the largest lower-bound scores are top-k results.

Sorted-list-based methods compute topk results by retrieving the involved sorted lists and naturally can support the actual queries. However, it is analyzed in this paper that the numbers of tuples retrieved and maintained in these methods increase exponentially with attribute number, increase polynomially with tuple number and result size.

Disadvantages:

Computational Overhead.
Data redundancy is more.
Time consuming process.

Problem Definition:

Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, computational advertising (online ad placement).

Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used only the top few documents, retrieved by some existing ranking models are checked.

Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used.

Proposed System:

Our proposed system describe with layered indexing to organize the tuples into multiple consecutive layers. The top-k results can be computed by at most k layers of tuples. Also our propose layer-based Pareto-Based Dominant Graph to express the dominant relationship between records and top-k query is implemented as a graph traversal problem.

Then propose a dual-resolution layer structure. Top k query can be processed efficiently by traversing the dual-resolution layer through the relationships between tuples. propose the Hybrid- Layer Index, which integrates layer level filtering and list-level filtering to significantly reduce the number of tuples retrieved in query processing propose view-based algorithms to pre-construct the specified materialized views according to some ranking functions.

Given a top-k query, one or more optimal materialized views are selected to return the top-k results efficiently. Propose LPTA+ to significantly improve efficiency of the state-of-the-art LPTA algorithm. The materialized views are cached in memory; LPTA+ can reduce the iterative calling of the linear programming sub-procedure, thus greatly improving the efficiency over the LPTA algorithm. In practical applications, a concrete index (or view) is built on a specific subset of attributes. Due to prohibitively expensive overhead to cover all attribute combinations, the indexes (or views) can only be built on a small and selective set of attribute combinations.

If the attribute combinations of top-k query are fixed, index-based or viewbased methods can provide a superior performance. However, on massive data, users often issue ad-hoc queries, it is very likely that the indexes (or views) involved in the ad-hoc queries are not built and the practicability of these methods is limited greatly.

Correspondingly, T2S only builds presorted table, on which top-k query on any attribute combination can be dealt with. This reduces the space overhead significantly compared with index-based (or view-based) methods, and enables actual practicability for T2S.

Advantages:

The evaluation of an information retrieval system is the process of assessing how well a system meets the information needs of its users.
Traditional evaluation metrics, designed for Boolean retrieval or top-k retrieval, include precision and recall.
All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query.

Modules:

Multi-keyword ranked search:

To design search schemes which allow multi-keyword query and provide result similarity ranking for effective data retrieval, instead of returning undifferentiated results.

Privacy-preserving:

To prevent the cloud server from learning additional information from the data set and the index, and to meet privacy requirements. if the cloud server deduces any association between keywords and encrypted documents from index, it may learn the major subject of a document, even the content of a short document. Therefore, the searchable index should be constructed to prevent the cloud server from performing such kind of association attack.

Efficiency:

Above goals on functionality and privacy should be achieved with low communication and computation overhead. Assume the number of query keywords appearing in a document the final similarity score is a linear function of xi, where the coefficient r is set as a positive random number. However, because the random factor “i is introduced as a part of the similarity score, the final search result on the basis of sorting similarity scores may not be as accurate as that in original scheme. For the consideration of search accuracy, we can let follow a normal distribution where the standard deviation functions as a flexible tradeoff parameter among search accuracy and security.

Conclusion:

The proposed novel T2S algorithm successfully implemented and to efficiently return top-k results on massive data by sequentially scanning the presorted table, in which the tuples are arranged in the order of round-robin retrieval on sorted lists. Only fixed number of candidates needs to be maintained in T2S. This paper proposes early termination checking and the analysis of the scan depth. Selective retrieval is devised in T2S and it is analyzed that most of the candidates in the presorted table can be skipped. The experimental results show that T2S significantly outperforms the existing algorithm.

Future Enhancement:

In future development of Multi keyword ranked search scheme should explore checking the integrity of the rank order in the search result from the un trusted network server infrastructure.

Feature Enhancement:

A novel table-scan-based T2S algorithm implemented successfully to compute top-k results on massive data efficiently. Given table T, T2Sfirst presorts T to obtain table PT(Presorted Table), whose tuples are arranged in the order of the round robin retrieval on the sorted lists. During its execution, T2S only maintains fixed and small number of tuples to compute results. It is proved that T2S has the Characteristic of early termination. It does not need to examine all tuples in PT to return results.

Effective Key Management in Dynamic Wireless Sensor Network

05/08/201902/07/2019 by admin

Data-Driven Composition for Service-Oriented Situational Web Applications

05/08/201902/07/2019 by admin