RANK-BASED SIMILARITY SEARCH REDUCING THE DIMENSIONAL DEPENDENCE

ABSTRACT:

This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.

INTRODUCTION

Of the fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, perhaps the most widely-encountered is that of similarity search. Similarity search is the foundation of k-nearest-neighbor (k-NN) classification, which often produces competitively-low error rates in practice, particularly when the number of classes is large. The error rate of nearest-neighbor classification has been shown to be ‘asymptotically optimal’ as the training set size increases. For clustering, many of the most effective and popular strategies require the determination of neighbor sets based at a substantial proportion of the data set objects: examples include hierarchical (agglomerative) methods such as content-based filtering methods for recommender systems and anomaly detection methods commonly make use of k-NN techniques, either through the direct use of k-NN search, or by means of k-NN cluster analysis.

A very popular density-based measure, the Local Outlier Factor (LOF), relies heavily on k-NN set computation to determine the relative density of the data in the vicinity of the test point [8]. For data mining applications based on similarity search, data objects are typically modeled as feature vectors of attributes for which some measure of similarity is defined Motivated at least in part by the impact of similarity search on problems in data mining, machine learning, pattern recognition, and statistics, the design and analysis of scalable and effective similarity search structures has been the subject of intensive research for many decades. Until relatively recently, most data structures for similarity search targeted low-dimensional real vector space representations and the euclidean or other Lp distance metrics.

However, many public and commercial data sets available today are more naturally represented as vectors spanning many hundreds or thousands of feature attributes that can be real or integer-valued, ordinal or categorical, or even a mixture of these types. This has spurred the development of search structures for more general metric spaces, such as the MultiVantage-Point Tree, the Geometric Near-neighbor Access Tree (GNAT), Spatial Approximation Tree (SAT), the M-tree, and (more recently) the Cover Tree (CT). Despite their various advantages, spatial and metric search structures are both limited by an effect often referred to as the curse of dimensionality.

One way in which the curse may manifest itself is in a tendency of distances to concentrate strongly around their mean values as the dimension increases. Consequently, most pairwise distances become difficult to distinguish, and the triangle inequality can no longer be effectively used to eliminate candidates from consideration along search paths. Evidence suggests that when the representational dimension of feature vectors is high (roughly 20 or more traditional similarity search accesses an unacceptably-high proportion of the data elements, unless the underlying data distribution has special properties. Even though the local neighborhood information employed by data mining applications is useful and meaningful, high data dimensionality tends to make this local information very expensive to obtain.

The performance of similarity search indices depends crucially on the way in which they use similarity information for the identification and selection of objects relevant to the query. Virtually all existing indices make use of numerical constraints for pruning and selection. Such constraints include the triangle inequality (a linear constraint on three distance values), other bounding surfaces defined in terms of distance (such as hypercubes or hyperspheres), range queries involving approximation factors as in Locality-Sensitive Hashing (LSH) or absolute quantities as additive distance terms. One serious drawback of such operations based on numerical constraints such as the triangle inequality or distance ranges is that the number of objects actually examined can be highly variable, so much so that the overall execution time cannot be easily predicted.

Similarity search, researchers and practitioners have investigated practical methods for speeding up the computation of neighborhood information at the expense of accuracy. For data mining applications, the approaches considered have included feature sampling for local outlier detection, data sampling for clustering, and approximate similarity search for k-NN classification. Examples of fast approximate similarity search indices include the BD-Tree, a widely-recognized benchmark for approximate k-NN search; it makes use of splitting rules and early termination to improve upon the performance of the basic KD-Tree. One of the most popular methods for indexing, Locality-Sensitive Hashing can also achieve good practical search performance for range queries by managing parameters that influence a tradeoff between accuracy and time.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

JAVA

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007

.NET

  • Operating System        :           Windows XP or Win7
  • Front End       :           Microsoft Visual Studio .NET 2008
  • Script :           C# Script
  • Back End :           MS-SQL Server 2005
  • Document :           MS-Office 2007

PSMPA: PATIENT SELF-CONTROLLABLE AND MULTI-LEVEL PRIVACY-PRESERVING COOPERATIVE AUTHENTICATION IN DISTRIBUTED M-HEALTHCARE CLOUD COMPUTING SYSTEM

ABSTRACT:

The Distributed m-healthcare cloud computing system considerably facilitates secure and efficient patient treatment for medical consultation by sharing personal health information among the healthcare providers. This system should bring about the challenge of keeping both the data confidentiality and patients’ identity privacy simultaneously. Many existing access control and anonymous authentication schemes cannot be straightforwardly exploited. To solve the problem proposed a novel authorized accessible privacy model (AAPM) is established. Patients can authorize physicians by setting an access tree supporting flexible threshold predicates.

Our new technique of attribute based designated verifier signature, a patient self-controllable multi-level privacy preserving cooperative authentication scheme (PSMPA) realizing three levels of security and privacy requirement in distributed m-healthcare cloud computing system is proposed. The directly authorized physicians, the indirectly authorized physicians and the unauthorized persons in medical consultation can respectively decipher the personal health information and/or verify patients’ identities by satisfying the access tree with their own attribute sets.

INTRODUCTION:

Distributed m-healthcare cloud computing systems have been increasingly adopted worldwide including the European Commission activities, the US Health Insurance Portability and Accountability Act (HIPAA) and many other governments for efficient and high-quality medical treatment. In m-healthcare social networks, the personal health information is always shared among the patients located in respective social communities suffering from the same disease for mutual support, and across distributed healthcare providers (HPs) equipped with their own cloud servers for medical consultant. However, it also brings about a series of challenges, especially how to ensure the security and privacy of the patients’ personal health information from various attacks in the wireless communication channel such as eavesdropping and tampering As to the security facet, one of the main issues is access control of patients’ personal health information, namely it is only the authorized physicians or institutions that can recover the patients’ personal health information during the data sharing in the distributed m-healthcare cloud computing system. In practice, most patients are concerned about the confidentiality of their personal health information since it is likely to make them in trouble for each kind of unauthorized collection and disclosure.

Therefore, in distributed m-healthcare cloud computing systems, which part of the patients’ personal health information should be shared and which physicians their personal health information should be shared with have become two intractable problems demanding urgent solutions. There has emerged various research results focusing on them. A fine-grained distributed data access control scheme is proposed using the technique of attribute based encryption (ABE). A rendezvous-based access control method provides access privilege if and only if the patient and the physician meet in the physical world. Recently, a patient-centric and fine-grained data access control in multi-owner settings is constructed for securing personal health records in cloud computing. However, it mainly focuses on the central cloud computing system which is not sufficient for efficiently processing the increasing volume of personal health information in m-healthcare cloud computing system.

Moreover, it is not enough for to only guarantee the data confidentiality of the patient’s personal health information in the honest-but-curious cloud server model since the frequent communication between a patient and a professional physician can lead the adversary to conclude that the patient is suffering from a specific disease with a high probability. Unfortunately, the problem of how to protect both the patients’ data confidentiality and identity privacy in the distributed m-healthcare cloud computing scenario under the malicious model was left untouched.

In this paper, we consider simultaneously achieving data confidentiality and identity privacy with high efficiency. As is described in Fig. 1, in distributed m-healthcare cloud computing systems, all the members can be classified into three categories: the directly authorized physicians with green labels in the local healthcare provider who are authorized by the patients and can both access the patient’s personal health information and verify the patient’s identity and the indirectly authorized physicians with yellow labels in the remote healthcare providers who are authorized by the directly authorized physicians for medical consultant or some research purposes (i.e., since they are not authorized by the patients, we use the term ‘indirectly authorized’ instead). They can only access the personal health information, but not the patient’s identity. For the unauthorized persons with red labels, nothing could be obtained. By extending the techniques of attribute based access control and designated verifier signatures (DVS) on de-identified health information

LITRATURE SURVEY

SECURING PERSONAL HEALTH RECORDS IN CLOUD COMPUTING: PATIENT-CENTRIC AND FINE-GRAINED DATA ACCESS CONTROL IN MULTI-OWNER SETTINGS

AUTHOR: M. Li, S. Yu, K. Ren, and W. Lou

PUBLISH: Proc. 6th Int. ICST Conf. Security Privacy Comm. Netw., 2010, pp. 89–106.

EXPLANATION:

Online personal health record (PHR) enables patients to manage their own medical records in a centralized way, which greatly facilitates the storage, access and sharing of personal health data. With the emergence of cloud computing, it is attractive for the PHR service providers to shift their PHR applications and storage into the cloud, in order to enjoy the elastic resources and reduce the operational cost. However, by storing PHRs in the cloud, the patients lose physical control to their personal health data, which makes it necessary for each patient to encrypt her PHR data before uploading to the cloud servers. Under encryption, it is challenging to achieve fine-grained access control to PHR data in a scalable and efficient way. For each patient, the PHR data should be encrypted so that it is scalable with the number of users having access. Also, since there are multiple owners (patients) in a PHR system and every owner would encrypt her PHR files using a different set of cryptographic keys, it is important to reduce the key distribution complexity in such multi-owner settings. Existing cryptographic enforced access control schemes are mostly designed for the single-owner scenarios. In this paper, we propose a novel framework for access control to PHRs within cloud computing environment. To enable fine-grained and scalable access control for PHRs, we leverage attribute based encryption (ABE) techniques to encrypt each patients’ PHR data. To reduce the key distribution complexity, we divide the system into multiple security domains, where each domain manages only a subset of the users. In this way, each patient has full control over her own privacy, and the key management complexity is reduced dramatically.

PRIVACY AND EMERGENCY RESPONSE IN E-HEALTHCARE LEVERAGING WIRELESS BODY SENSOR NETWORKS

AUTHOR: J. Sun, Y. Fang, and X. Zhu

PUBLISH: IEEE Wireless Commun., vol. 17, no. 1, pp. 66–73, Feb. 2010.

EXPLANATION:

Electronic healthcare is becoming a vital part of our living environment and exhibits advantages over paper-based legacy systems. Privacy is the foremost concern of patients and the biggest impediment to e-healthcare deployment. In addressing privacy issues, conflicts from the functional requirements must be taken into account. One such requirement is efficient and effective response to medical emergencies. In this article, we provide detailed discussions on the privacy and security issues in e-healthcare systems and viable techniques for these issues. Furthermore, we demonstrate the design challenge in the fulfillment of conflicting goals through an exemplary scenario, where the wireless body sensor network is leveraged, and a sound solution is proposed to overcome the conflict.

HCPP: CRYPTOGRAPHY BASED SECURE EHR SYSTEM FOR PATIENT PRIVACY AND EMERGENCY HEALTHCARE

AUTHOR: J. Sun, X. Zhu, C. Zhang, and Y. Fang

PUBLISH: Proc. 31st Int. Conf. Distrib. Comput. Syst., 2011, pp. 373–382.

EXPLANATION:

Privacy concern is arguably the major barrier that hinders the deployment of electronic health record (EHR) systems which are considered more efficient, less error-prone, and of higher availability compared to traditional paper record systems. Patients are unwilling to accept the EHR system unless their protected health information (PHI) containing highly confidential data is guaranteed proper use and disclosure, which cannot be easily achieved without patients’ control over their own PHI. However, cautions must be taken to handle emergencies in which the patient may be physically incompetent to retrieve the controlled PHI for emergency treatment. In this paper, we propose a secure EHR system, HCPP (Healthcaresystem for Patient Privacy), based on cryptographic constructions and existing wireless network infrastructures, to provide privacy protection to patients under any circumstances while enabling timelyPHI retrieval for life-saving treatment in emergency situations. Furthermore, our HCPP system restricts PHI access to authorized (not arbitrary) physicians, who can be traced and held accountable if the accessed PHI is found improperly disclosed. Last but not least, HCPP leverages wireless network access to support efficient and private storage/retrieval of PHI, which underlies a secure and feasible EHR system.

PRIVACY-PRESERVING DETECTION OF SENSITIVE DATA EXPOSURE

ABSTRACT:

Statistics from security firms, research institutions and government organizations show that the numbers of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced.

In this paper, we present a privacy preserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.

INTRODUCTION

According to a report from Risk Based Security (RBS), the number of leaked sensitive data records has increased dramatically during the last few years, i.e., from 412 million in 2012 to 822 million in 2013. Deliberately planned attacks, inadvertent leaks (e.g., forwarding confidential emails to unclassified email accounts), and human mistakes (e.g., assigning the wrong privilege) lead to most of the data-leak incidents. Detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content. Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS). Straightforward realizations of data-leak detection require the plaintext sensitive data.

However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory). In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

In this paper, we propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

In our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

Our contributions are summarized as follows.

1) We describe a privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model. We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

2) We implement our detection system and perform extensive experimental evaluation on 2.6 GB Enron dataset, Internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency. Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used. In addition, we give an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

SYSTEM ANALYSIS

EXISTING SYSTEM:

  • Existing detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection, and policy enforcement.
  • Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content.
  • Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS).
  • Straightforward realizations of data-leak detection require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory).
  • In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

DISADVANTAGES:

  • As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information.
  • These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet.
  • Existing approach for quantifying information leak capacity in network traffic instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume.
  • We take disadvantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer.

PROPOSED SYSTEM:

  • We propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations.
  • Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data.
  • Our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.
  • Our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them.
  • To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

ADVANTAGES:

  • We describe privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model.
  • We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.
  • We implement our detection system and perform extensive experimental evaluation on internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency.
  • Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

   Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk       –   20 GB
  • Floppy Drive        –    44 MB
  • Key Board       –    Standard Windows Keyboard
  • Mouse        –    Two or Three Button Mouse
  • Monitor       –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System          :           Windows XP or Win7
  • Front End        :           Microsoft Visual Studio .NET  
  • Back End :           MS-SQL Server
  • Server :           ASP .NET Web Server
  • Script :           C# Script
  • Document :           MS-Office 2007

PASSIVE IP TRACEBACK: DISCLOSING THE LOCATIONS OF IP SPOOFERS FROM PATH BACKSCATTER

 ABSTRACT:

It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.

This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.

This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.

These results can help further reveal IP spoofing, which has been studied for long but never well understood. Though PIT cannot work in all the spoofing attacks, it may be the most useful mechanism to trace spoofers before an Internet-level traceback system has been deployed in real.

 INTRODUCTION

IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.

To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.

Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

Our work has the following contributions:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities. Though Moore et al. [8] has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback. 2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real. 3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

PANDA: PUBLIC AUDITING FOR SHARED DATA WITH EFFICIENT USER REVOCATION IN THE CLOUD

ABSTRACT:

With data storage and sharing services in the cloud, users can easily modify and share data as a group. To ensure share data integrity can be verified publicly, users in the group need to compute signatures on all the blocks in shared data. Different blocks in shared data are generally signed by different users due to data modifications performed by different users. For security reasons, once a user is revoked from the group, the blocks which were previously signed by this revoked user must be re-signed by an existing user. The straight forward method, which allows an existing user to download the corresponding part of shared data and re-sign it during user revocation, is inefficient due to the large size of shared data in the cloud. In this paper, we propose a novel public auditing mechanism

For the integrity of shared data with efficient user revocation in mind. By utilizing the idea of proxy re-signatures, we allow the cloud tore-sign blocks on behalf of existing users during user revocation, so that existing users do not need to download and re-sign blocks by themselves. In addition, a public verifier is always able to audit the integrity of shared data without retrieving the entire data from the

Cloud, even if some part of shared data has been re-signed by the cloud. Moreover, our mechanism is able to support batch auditing by verifying multiple auditing tasks simultaneously. Experimental results show that our mechanism can significantly improve the efficiency of user revocation.

INTRODUCTION

With data storage and sharing services (such as Dropbox and Google Drive) provided by the cloud, people can easily work together as a group by sharing data with each other. More specifically, once a user creates shared data in the cloud, every user in the group is able to not only access and modify shared data, but also share the latest version of the shared data with the rest of the group. Although cloud providers promise a more secure and reliable environment to the users, the integrity of data in the cloud may still be compromised, due to the existence of hardware/software failures and human errors.

To protect the integrity of data in the cloud, a number of mechanisms have been proposed. In these mechanisms, a signature is attached to each block in data, and the integrity of data relies on the correctness of all the signatures. One of the most significant and common features of these mechanisms is to allow a public verifier to efficiently check data integrity in the cloud without downloading the entire data, referred to as public auditing (or denoted as Provable Data Possession). This public verifier could be a client who would like to utilize cloud data for particular purposes (e.g., search, computation, data mining, etc.) or a thirdparty auditor (TPA) who is able to provide verification services on data integrity to users. Most of the previous works focus on auditing the integrity of personal data. Different from these works, several recent works focus on how to preserve identity privacy from public verifiers when auditing the integrity of shared data. Unfortunately, none of the above mechanisms, considers the efficiency of user revocation when auditing the correctness of shared data in the cloud.

With shared data, once a user modifies a block, she also needs to compute a new signature for the modified block. Due to the modifications from different users, different blocks are signed by different users. For security reasons, when a user leaves the group or misbehaves, this user must be revoked from the group. As a result, this revoked user should no longer be able to access and modify shared data, and the signatures generated by this revoked user are no longer valid to the group. Therefore, although the content of shared data is not changed during user revocation, the blocks, which were previously signed by the revoked user, still need to be re-signed by an existing user in the group. As a result, the integrity of the entire data can still be verified with the public keys of existing users only.

Since shared data is outsourced to the cloud and users no longer store it on local devices, a straightforward method to re-compute these signatures during user revocation is to ask an existing user to first download the blocks previously signed by the revoked user verify the correctness of these blocks, then re-sign these blocks, and finally upload the new signatures to the cloud. However, this straightforward method may cost the existing user a huge amount of communication and computation resources by downloading and verifying blocks, and by re-computing and uploading signatures, especially when the number of re-signed blocks is quite large or the membership of the group is frequently changing. To make this matter even worse, existing users may access their data sharing services provided by the cloud with resource limited devices, such as mobile phones, which further prevents existing users from maintaining the correctness of shared data efficiently during user revocation.

Clearly, if the cloud could possess each user’s private key, it can easily finish the re-signing task for existing users without asking them to download and re-sign blocks. However, since the cloud is not in the same trusted domain with each user in the group, outsourcing every user’s private key to the cloud would introduce significant security issues. Another important problem we need to consider is that the re-computation of any signature during user revocation should not affect the most attractive property of public auditing — auditing data integrity publicly without retrieving the entire data. Therefore, how to efficiently reduce the significant burden to existing users introduced by user revocation, and still allow a public verifier to check the integrity of shared data without downloading the entire data from the cloud, is a challenging task.

In this paper, we propose Panda, a novel public auditing mechanism for the integrity of shared data with efficient user revocation in the cloud. In our mechanism, by utilizing the idea of proxy re-signatures, once a user in the group is revoked, the cloud is able to resign the blocks, which were signed by the revoked user, with a re-signing key. As a result, the efficiency of user revocation can be significantly improved, and computation and communication resources of existing users can be easily saved. Meanwhile, the cloud, who is not in the same trusted domain with each user, is only able to convert a signature of the revoked user into a signature of an existing user on the same block, but it cannot sign arbitrary blocks on behalf of either the revoked user or an existing user. By designing a new proxy re-signature scheme with nice properties, which traditional proxy resignatures do no have, our mechanism is always able to check the integrity of shared data without retrieving the entire data from the cloud.

NEIGHBOR SIMILARITY TRUST AGAINST SYBIL ATTACK IN P2P E-COMMERCE

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

Malware Propagation in Large-Scale Networks

Malware is pervasive in networks, and poses a critical threat to network security. However, we have very limited understanding of malware behavior in networks to date. In this paper, we investigate how malware propagate in networks from a global perspective. We formulate the problem, and establish a rigorous two layer epidemic model for malware propagation from network to network. Based on the proposed model, our analysis indicates that the distribution of a given malware follows exponential distribution, power law distribution with a short exponential

tail, and power law distribution at its early, late and final stages, respectively. Extensive experiments have been performed through two real-world global scale malware data sets, and the results confirm our theoretical findings.

LOSSLESS AND REVERSIBLE DATA HIDING IN ENCRYPTED IMAGES WITH PUBLIC KEY CRYPTOGRAPHY

This paper proposes a lossless, a reversible, and a combined data hiding schemes for ciphertext images encrypted by public key cryptosystems with probabilistic and homomorphic properties. In the lossless scheme, the ciphertext pixels are replaced with new values to embed the additional data into several LSB-planes of ciphertext pixels by multi-layer wet paper coding. Then, the embedded data can be directly extracted from the encrypted domain, and the data embedding operation does not affect the decryption of original plaintext image. In the reversible scheme, a preprocessing is employed to shrink the image histogram before image encryption, so that the modification on encrypted images for data embedding will not cause any pixel oversaturation in plaintext domain. Although a slight distortion is introduced, the embedded data can be extracted and the original image can be recovered from the directly decrypted image. Due to the compatibility between the lossless and reversible schemes, the data embedding operations in the two manners can be simultaneously performed in an encrypted image. With the combined technique, a receiver may extract a part of embedded data before decryption, and extract another part of embedded data and recover the original plaintext image after decryption.

Joint Beamforming, Power and Channel Allocation in Multi-User and Multi-Channel Underlay MISO Cognitive Radio Networks

In this paper, we consider a joint beamforming, power, and channel allocation in a multi-user and multi-channel underlay multiple input single output (MISO) cognitive radio network (CRN). In this system, primary users’ (PUs’) spectrum can be reused by the secondary user transmitters (SUTXs) to maximize the spectrum utilization while the intra-user interference is minimized by implementing beamforming at each SU-TX. After formulating the joint optimization problem as a non-convex, mixed integer nonlinear programming (MINLP) problem, we propose a solution which consists of two stages.

In the first stage, a feasible solution for power allocation and beamforming vectors is derived under a given channel allocation by converting the original problem into a convex form with an introduced optimal auxiliary variable and semidefinite relaxation (SDR) approach. After that, in the second stage, two explicit searching algorithms, i.e., genetic algorithm (GA) and simulated annealing (SA)-based algorithm, are proposed to determine suboptimal channel allocations. Simulation results show that
beamforming, power and channel allocation with SA (BPCA-SA) algorithm can achieve close-to-optimal sum-rate while having a lower computational complexity compared with beamforming, power and channel allocation with GA (BPCA-GA) algorithm.

Furthermore, our proposed allocation scheme has significant improvement in achievable sum-rate compared to the existing zero-forcing beamforming (ZFBF).

GDCLUSTER A GENERAL DECENTRALIZED CLUSTERING ALGORITHM

In many popular applications like peer-to-peer systems, large amounts of data are distributed among multiple sources. Analysis of this data and identifying clusters is challenging due to processing, storage, and transmission costs. In this paper, we propose GDCluster, a general fully decentralized clustering method, which is capable of clustering dynamic and distributed data sets. Nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the data set. We customize GDCluster for execution of the partition-based and density-based clustering methods on the summarized views, and also offer enhancements to the basic algorithm. Coping with dynamic data is made possible by gradually adapting the clustering model. Our experimental evaluations show that GDCluster can discover the clusters efficiently with scalable transmission cost, and also expose its supremacy in comparison to the popular method LSP2P.

Efficient Top-k Retrieval on Massive Data

Top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed.

Effective Key Management in Dynamic Wireless Sensor Networks

Recently, wireless sensor networks (WSNs) have been deployed for a wide variety of applications, including military sensing and tracking, patient status monitoring, traffic flow monitoring, where sensory devices often move between different locations. Securing data and communications requires suitable encryption key protocols. In this paper, we propose a certificateless-effective key management (CL-EKM) protocol for secure communication in dynamic WSNs characterized by node mobility. The CL-EKM supports efficient key updates when a node leaves or joins a cluster and ensures forward and backward key secrecy. The protocol also supports efficient key revocation for compromised nodes and minimizes the impact of a node compromise on the security of other communication links. A security analysis of our scheme shows that our protocol is effective in defending against various attacks.We implement CL-EKM in Contiki OS and simulate it using Cooja simulator to assess its time, energy, communication, and memory performance.

Discovery of Ranking Fraud for Mobile Apps

Ranking fraud in the mobile App market refers to fraudulent or deceptive activities which have a purpose of bumping up the Apps in the popularity list. Indeed, it becomes more and more frequent for App developers to use shady means, such as inflating their Apps’ sales or posting phony App ratings, to commit ranking fraud. While the importance of preventing ranking fraud has been widely recognized, there is limited understanding and research in this area. To this end, in this paper, we provide a holistic view of ranking fraud and propose a ranking fraud detection system for mobile Apps. Specifically, we first propose to accurately locate the ranking fraud by mining the active periods, namely leading sessions, of mobile Apps. Such leading sessions can be leveraged for detecting the local anomaly instead of global anomaly of App rankings. Furthermore, we investigate three types of evidences, i.e., ranking based evidences, rating based evidences and review based evidences, by modeling Apps’ ranking, rating and review behaviors through statistical hypotheses tests. In addition, we propose an optimization based aggregation method to integrate all the evidences for fraud detection.

DETECTION AND RECTIFICATION OF DISTORTED FINGERPRINTS

Elastic distortion of fingerprints is one of the major causes for false non-match. While this problem affects all fingerprint recognition applications, it is especially dangerous in negative recognition applications, such as watchlist and deduplication applications. In such applications, malicious users may purposely distort their fingerprints to evade identification. In this paper, we proposed novel algorithms to detect and rectify skin distortion based on a single fingerprint image. Distortion detection is viewed as a two-class classification problem, for which the registered ridge orientation map and period map of a fingerprint are used as the feature vector and a SVM classifier is trained to perform the classification task.

Distortion rectification (or equivalently distortion field estimation) is viewed as a regression problem, where the input is a distorted fingerprint and the output is the distortion field. To solve this problem, a database (called reference database) of various distorted reference fingerprints and corresponding distortion fields is built in the offline stage, and then in the online stage, the nearest neighbor of the input fingerprint is found in the reference database and the corresponding distortion field is used to transform the input fingerprint into a normal one. Promising results have been obtained on three databases containing many distorted fingerprints, namely FVC2004 DB1, Tsinghua Distorted Fingerprint database, and the NIST SD27 latent fingerprint database.

DATA-DRIVEN COMPOSITION FOR SERVICE-ORIENTED SITUATIONAL WEB APPLICATIONS

This paper presents a systematic data-driven approach to assisting situational application development. We first propose a technique to extract useful information from multiple sources to abstract service capabilities with set tags. This supports intuitive expression of user’s desired composition goals by simple queries, without having to know underlying technical details. A planning technique then exploits composition solutions which can constitute the desired goals, even with some potential new interesting composition opportunities. A browser-based tool facilitates visual and iterative refinement of composition solutions, to finally come up with the satisfying outputs. A series of experiments demonstrate the efficiency and effectiveness of our approach.

Data-driven composition technique for situational web applications by using tag-based semantics in to illustrate the overall life-cycle of our “compose as-you-search” composition approach, to propose the clustering technique for deriving tag-based composition semantics, and to evaluate the composition planning effectiveness, respectively. Compared with previous work, this paper is significantly updated by introducing a semi-supervised technique for clustering hierarchical tag based semantics from service documentations and human-annotated annotations. The derived semantics link service capabilities and developers’ processing goals, so that the composition is processed by planning the “Tag HyperLinks” from initialquery to the goals.

The planning algorithm is also further evaluated in terms of recommendation quality, performance, and scalability over data sets from real-world service repositories. Results show that our approach reaches satisfying precision and high-quality composition recommendations. We also demonstrate that our approach can accommodate even larger size of services than real world repositories so as to promise performance. Besides, more details of our interactive development prototyping are presented. We particularly demonstrate how the composition UI can help developers intuitively compose situational applications, and iteratively refine their goals until requirements are finally satisfied.