PRIVACY-PRESERVING RELATIVE LOCATION BASED SERVICES FOR MOBILE USERS

ABSTRACT:

Location-aware applications have been used widely with the assistance of the latest positioning features in Smart Phone such as GPS, AGPS, etc. However, all the existing applications gather users’ geographical data and transfer them into the pertinent information to give meaning and value. For this kind of solutions, the user’s privacy and security issues might be raised because the geographical location has to be exposed to the service provider. A novel and practical solution is proposed in this article to provide the relative location of two mobile users based on their WiFi scanned results without any additional sensors. There is no privacy concern in this solution because end users will not collect and send any sensitive information to the server. This solution adopts a Client/Server (C/S) architecture, where the mobile user as a client reports the ambient WiFi APs and the server calculates the distances based on the WiFi AP’s topological relationships. A series of technologies are explored to improve the accuracy of the estimated distance and the corresponding algorithms are proposed. We also prove the feasibility with the prototype of “Circle Your Friends” System (CYFS) on Android phone which lets the mobile user know the distance between him and his social network friends.

INTRODUCTION:

LOCATION-AWARE APPLICATIONS:

Location awareness refers to devices that can passively or actively determine their location. Navigational instruments provide location coordinates for vessels and vehicles. Surveying equipment identifies location with respect to a well-known location a wireless communications device. Network location awareness (NLA) describes the location of a node in a network. The term applies to navigating, real-time locating and positioning support with global, regional or local scope. The term has been applied to traffic, logistics, business administration and leisure applications. Location awareness is supported by navigation systems, positioning systems and/or locating services. Location awareness without the active participation of the device is known as non-cooperative locating or detection. Location-aware applications use the geographical position of a mobile worker or an asset to execute a task. Position is detected mainly through satellite technologies, such as a GPS, or through mobile location technologies in cellular networks and mobile devices.

Examples include fleet management applications with mapping, navigation and routing functionalities, government inspections and integration with geographic information system applications.  Location-aware applications deliver specified messages to users based on their physical location. This kind of services can be divided into two types: absolute-location services and relative-location services. Absolute location is locating a place using a coordinate system while relative location means to locate a place relative to other landmarks. Location services require the users to report their absolute location data to the server and then the server return the querying result. Usually the technologies to detect and retrieve the location data include GPS, mobile cell id (CID), WiFi AP. For these methodologies, serious privacy concerns are raised because they enable the continuous tracking of involved users’ location. Two major types of privacy concerns are triggered: the potential information leakage in communications and the inappropriate usage of this information by the service providers.

EXISTING SYSTEM:

The rapid proliferation of smart phone technology in urban communities has enabled mobile users to utilize context aware services on their devices. Service providers take advantage of this dynamic and ever-growing technology landscape by proposing innovative context-dependent services for mobile subscribers. Location-based Services (LBS), for example, are used by millions of mobile subscribers every day to obtain location-specific information .Two popular features of location-based services are location check-insand location sharing. By checking into a location, users can share their current location with family and friends or obtain location-specific services from third-party providers, the obtained service does not depend on the locations of other users.

The other types of location-based services, which rely on sharing of locations (or location preferences) by a group of users in order to obtain some service for the whole group, are also becoming popular. According to a recent study, location sharing services are used by almost 20% of all mobile phone users. One prominent example of such a service is the taxi-sharing application, offered by a global telecom operator, where smart phone users can share a taxi with other users at a suitable location by revealing their departure and destination locations. Similarly, another popular service enables a group of users to find the most geographically convenient place to meet.

DISADVANTAGES:

  • Privacy of a user’s location or location preferences, with respect to other users and the third-party service provider, is a critical concern in such location-sharing-based applications.
  • For instance, such information can be used to de-anonymize users and their availabilities, to track their preferences or to identify their social networks.
  • For example, in the taxi-sharing application, a curious third-party service provider could easily deduce home/work location pairs of users who regularly use their service.
  • Without effective protection, evens parse location information has been shown to provide reliable information about a users’ private sphere, which could have severe consequences on the users’ social, financial and private life.
  • Even service providers who legitimately track users’ location information in order to improve the offered service can inadvertently harm users’ privacy, if the collected data is leaked in an unauthorized fashion or improperly shared with corporate partners.

PROPOSED SYSTEM:

We propose a simple and novel solution to provide the relative distance of two mobile devices without collecting any personal sensitive data. It can guarantee 100% privacy of the users when providing location based service. Since no absolute information is detected, none of above privacy-protected mechanisms needs to be adopted in our solution. At the same time, some methods are put forward to improve the accuracy of the relative distance. Our approach undertakes and integrates more parameters to improve the accuracy for WiFi positioning system such as IEEE protocol type, overlap ratio, etc. More importantly, all these mechanisms have been revisited and redesigned carefully to make them more applicable.

We address the privacy issue in LSBSs by focusing on a specific problem called the CYFS. Given a set of user location preferences, the CYFS is to determine a location among the proposed ones such that the maximum distance between this location and all other users’ locations is minimized, i.e. it is fairto all users. To prove its feasibility, a prototype based on Facebook is developed on Android based mobile devices. By evaluating the accuracy of estimated distance, though the precision is not good as GPS, it has proved that our privacy-free solution is suitable for social networking and location-based application. The future work includes developing the application on Google Android as well as Apple IOS devices. Furthermore, if possible, it also includes integrating the privacy-preserving relative location based service into other social networking applications such as Wechat and QQ.

ADVANTAGES:

  • In the proposed system, Problem in a privacy-preserving fashion, where each user participates by providing only a single location preference to the CYFS solver or the service provider.
  • In this significantly extended version of our earlier conference paper, we evaluate the security of our proposal under various passive and active adversarial scenarios, including collusion.
  • We also provide an accurate and detailed analysis of the privacy properties of our proposal and show that our algorithms do not provide any probabilistic advantage to a passive adversary in correctly guessing the preferred location of any participant.
  • In addition to the theoretical analysis, we also evaluate the practical efficiency and performance of the proposed algorithms by means of a prototype implementation on a test bed of Nokia mobile devices. We also address the multi-preference case, where each user may have multiple prioritized location preferences.
  • We highlight the main differences, in terms of   performance, with the single preference case, and also present initial experimental results for the multi-preference implementation. Finally, by means of a targeted user study, we provide insight into the usability of our proposed solutions.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor                            –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Script :           JSP Script       
  • Document                               :           MS-Office 2007

PRIVACY-PRESERVING DETECTION OF SENSITIVE DATA EXPOSURE

ABSTRACT:

Statistics from security firms, research institutions and government organizations show that the numbers of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced.

In this paper, we present a privacy preserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.

INTRODUCTION:

According to a report from Risk Based Security (RBS), the number of leaked sensitive data records has increased dramatically during the last few years, i.e., from 412 million in 2012 to 822 million in 2013. Deliberately planned attacks, inadvertent leaks (e.g., forwarding confidential emails to unclassified email accounts), and human mistakes (e.g., assigning the wrong privilege) lead to most of the data-leak incidents. Detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection and policy enforcement.

Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content. Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS). Straightforward realizations of data-leak detection require the plaintext sensitive data.

However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory). In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

In this paper, we propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations. Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data. Using our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.

In our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

Our contributions are summarized as follows.

1) We describe a privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model. We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.

2) We implement our detection system and perform extensive experimental evaluation on 2.6 GB Enron dataset, Internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency. Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used. In addition, we give an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

LITRATURE SURVEY

PRIVACY-AWARE COLLABORATIVE SPAM FILTERING

AUTHORS: K. Li, Z. Zhong, and L. Ramaswamy

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 5, pp. 725–739, May 2009.

EXPLANATION:

While the concept of collaboration provides a natural defense against massive spam e-mails directed at large numbers of recipients, designing effective collaborative anti-spam systems raises several important research challenges. First and foremost, since e-mails may contain confidential information, any collaborative anti-spam approach has to guarantee strong privacy protection to the participating entities. Second, the continuously evolving nature of spam demands the collaborative techniques to be resilient to various kinds of camouflage attacks. Third, the collaboration has to be lightweight, efficient, and scalable. Toward addressing these challenges, this paper presents ALPACAS-a privacy-aware framework for collaborative spam filtering. In designing the ALPACAS framework, we make two unique contributions. The first is a feature-preserving message transformation technique that is highly resilient against the latest kinds of spam attacks. The second is a privacy-preserving protocol that provides enhanced privacy guarantees to the participating entities. Our experimental results conducted on a real e-mail data set shows that the proposed framework provides a 10 fold improvement in the false negative rate over the Bayesian-based Bogofilter when faced with one of the recent kinds of spam attacks. Further, the privacy breaches are extremely rare. This demonstrates the strong privacy protection provided by the ALPACAS system.

DATA LEAK DETECTION AS A SERVICE: CHALLENGES AND SOLUTIONS

AUTHORS: X. Shu and D. Yao

PUBLISH: Proc. 8th Int. Conf. Secur. Privacy Commun. Netw., 2012, pp. 222–240

EXPLANATION:

We describe network-based data-leak detection (DLD) technique, the main feature of which is that the detection does not require the data owner to reveal the content of the sensitive data. Instead, only a small amount of specialized digests are needed. Our technique – referred to as the fuzzy fingerprint – can be used to detect accidental data leaks due to human errors or application flaws. The privacy-preserving feature of our algorithms minimizes the exposure of sensitive data and enables the data owner to safely delegate the detection to others. We describe how cloud providers can offer their customers data-leak detection as an add-on service with strong privacy guarantees. We perform extensive experimental evaluation on the privacy, efficiency, accuracy and noise tolerance of our techniques. Our evaluation results under various data-leak scenarios and setups show that our method can support accurate detection with very small number of false alarms, even when the presentation of the data has been transformed. It also indicates that the detection accuracy does not degrade when partial digests are used. We further provide a quantifiable method to measure the privacy guarantee offered by our fuzzy fingerprint framework.

QUANTIFYING INFORMATION LEAKS IN OUTBOUND WEB TRAFFIC

AUTHORS: K. Borders and A. Prakash

PUBLISH: Proc. 30th IEEE Symp. Secur. Privacy, May 2009, pp. 129–140.

EXPLANATION:

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. Todaypsilas network traffic is so voluminous that manual inspection would be unreasonably expensive. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information. These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet. We present an approach for quantifying information leak capacity in network traffic. Instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume. We take advantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer. In this paper, we present measurement algorithms for the Hypertext Transfer Protocol (HTTP), the main protocol for Web browsing. When applied to real Web browsing traffic, the algorithms were able to discount 98.5% of measured bytes and effectively isolate information leaks.

SYSTEM ANALYSIS

EXISTING SYSTEM:

  • Existing detecting and preventing data leaks requires a set of complementary solutions, which may include data-leak detection, data confinement, stealthy malware detection, and policy enforcement.
  • Network data-leak detection (DLD) typically performs deep packet inspection (DPI) and searches for any occurrences of sensitive data patterns. DPI is a technique to analyze payloads of IP/TCP packets for inspecting application layer data, e.g., HTTP header/content.
  • Alerts are triggered when the amount of sensitive data found in traffic passes a threshold. The detection system can be deployed on a router or integrated into existing network intrusion detection systems (NIDS).
  • Straightforward realizations of data-leak detection require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. If a detection system is compromised, then it may expose the plaintext sensitive data (in memory).
  • In addition, the data owner may need to outsource the data-leak detection to providers, but may be unwilling to reveal the plaintext sensitive data to them. Therefore, one needs new data-leak detection solutions that allow the providers to scan content for leaks without learning the sensitive information.

DISADVANTAGES:

  • As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of keeping confidential information from leaving their networks. In response, researchers have created data loss prevention systems that check outgoing traffic for known confidential information.
  • These systems stop naive adversaries from leaking data, but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a high-capacity pipe for tunneling data to the Internet.\
  • Existing approach for quantifying information leak capacity in network traffic instead of trying to detect the presence of sensitive data-an impossible task in the general case–our goal is to measure and constrain its maximum volume.
  • We take disadvantage of the insight that most network traffic is repeated or determined by external information, such as protocol specifications or messages sent by a server. By filtering this data, we can isolate and quantify true information flowing from a computer.

PROPOSED SYSTEM:

  • We propose a data-leak detection solution which can be outsourced and be deployed in a semihonest detection environment. We design, implement, and evaluate our fuzzy fingerprint technique that enhances data privacy during data-leak detection operations.
  • Our approach is based on a fast and practical one-way computation on the sensitive data (SSN records, classified documents, sensitive emails, etc.). It enables the data owner to securely delegate the content-inspection task to DLD providers without exposing the sensitive data.
  • Our detection method, the DLD provider, who is modeled as an honest-but-curious (aka semi-honest) adversary, can only gain limited knowledge about the sensitive data from either the released digests, or the content being inspected. Using our techniques, an Internet service provider (ISP) can perform detection on its customers’ traffic securely and provide data-leak detection as an add-on service for its customers. In another scenario, individuals can mark their own sensitive data and ask the administrator of their local network to detect data leaks for them.
  • Our detection procedure, the data owner computes a special set of digests or fingerprints from the sensitive data and then discloses only a small amount of them to the DLD provider. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them.
  • To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. It is the data owner, who post-processes the potential leaks sent back by the DLD provider and determines whether there is any real data leak.

ADVANTAGES:

  • We describe privacy-preserving data-leak detection model for preventing inadvertent data leak in network traffic. Our model supports detection operation delegation and ISPs can provide data-leak detection as an add-on service to their customers using our model.
  • We design, implement, and evaluate an efficient technique, fuzzy fingerprint, for privacy-preserving data-leak detection. Fuzzy fingerprints are special sensitive data digests prepared by the data owner for release to the DLD provider.
  • We implement our detection system and perform extensive experimental evaluation on internet surfing traffic of 20 users, and also 5 simulated real-worlds data-leak scenarios to measure its privacy guarantee, detection rate and efficiency.
  • Our results indicate high accuracy achieved by our underlying scheme with very low false positive rate. Our results also show that the detection accuracy does not degrade much when only partial (sampled) sensitive-data digests are used an empirical analysis of our fuzzification as well as of the fairness of fingerprint partial disclosure.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007

PRIVACY POLICY INFERENCE OF USER-UPLOADED IMAGES ON CONTENT SHARING SITES

 ABSTRACT:

With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences.

We propose a two-level framework which according to the user’s available history on the site, determines the best available privacy policy for the user’s images being uploaded. Our solution relies on an image classification framework for image categories which may be associated with similar policies, and on a policy prediction algorithm to automatically generate a policy for each newly uploaded image, also according to users’ social features. Over time, the generated policies will follow the evolution of users’ privacy attitude. We provide the results of our extensive evaluation over 5,000 policies, which demonstrate the effectiveness of our system, with prediction accuracies over 90 percent.

INTRODUCTION

Images are now one of the key enablers of users’ connectivity. Sharing takes place both among previously established groups of known people or social circles (e. g., Google+, Flickr or Picasa), and also increasingly with people outside the users social circles, for purposes of social discovery-to help them identify new peers and learn about peers interests and social surroundings. However, semantically rich images may reveal contentsensitive information. Consider a photo of a students 2012 graduationceremony, for example.

It could be shared within a Google+ circle or Flickr group, but may unnecessarily expose the studentsBApos familymembers and other friends. Sharing images within online content sharing sites,therefore,may quickly leadto unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings. One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone. Therefore, many have acknowledged the need of policy recommendation systems which can assist users to easily and properly configure privacy settings. However, existing proposals for automating privacy settings appear to be inadequate to address the unique privacy needs of images due to the amount of information implicitly carried within images, and their relationship with the online environment wherein they are exposed.

LITRATURE SURVEY

TITLE NAME: SHEEPDOG: GROUP AND TAG RECOMMENDATION FOR FLICKR PHOTOS BY AUTOMATIC SEARCH-BASED LEARNING

AUTHOR: H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu,

PUBLISH: Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.

EXPLANATION:

Online photo albums have been prevalent in recent years and have resulted in more and more applications developed to provide convenient functionalities for photo sharing. In this paper, we propose a system named SheepDog to automatically add photos into appropriate groups and recommend suitable tags for users on Flickr. We adopt concept detection to predict relevant concepts of a photo and probe into the issue about training data collection for concept classification. From the perspective of gathering training data by web searching, we introduce two mechanisms and investigate their performances of concept detection. Based on some existing information from Flickr, a ranking-based method is applied not only to obtain reliable training data, but also to provide reasonable group/tag recommendations for input photos. We evaluate this system with a rich set of photos and the results demonstrate the effectiveness of our work.

TITLE NAME: CONNECTING CONTENT TO COMMUNITY IN SOCIAL MEDIA VIA IMAGE CONTENT, USER TAGS AND USER COMMUNICATION

AUTHOR: M. D. Choudhury, H. Sundaram, Y.-R. Lin, A. John, and D. D. Seligmann

PUBLISH: Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp.1238–1241.

EXPLANATION:

In this paper we develop a recommendation framework to connect image content with communities in online social media. The problem is important because users are looking for useful feedback on their uploaded content, but finding the right community for feedback is challenging for the end user. Social media are characterized by both content and community. Hence, in our approach, we characterize images through three types of features: visual features, user generated text tags, and social interaction (user communication history in the form of comments). A recommendation framework based on learning a latent space representation of the groups is developed to recommend the most likely groups for a given image. The model was tested on a large corpus of Flickr images comprising 15,689 images. Our method outperforms the baseline method, with a mean precision 0.62 and mean recall 0.69. Importantly, we show that fusing image content, text tags with social interaction features outperforms the case of only using image content or tags.

TITLE NAME: ANALYSING FACEBOOK FEATURES TO SUPPORT EVENT DETECTION FOR PHOTO-BASED FACEBOOK APPLICATIONS

AUTHOR: M. Rabbath, P. Sandhaus, and S. Boll,

PUBLISH: Proc. 2nd ACM Int. Conf. Multimedia Retrieval, 2012, pp. 11:1–11:8.

EXPLANATION:

Facebook witnesses an explosion of the number of shared photos: With 100 million photo uploads a day it creates as much as a whole Flickr each two months in terms of volume. Facebook has also one of the healthiest platforms to support third party applications, many of which deal with photos and related events. While it is essential for many Facebook applications, until now there is no easy way to detect and link photos that are related to the same events, which are usually distributed between friends and albums. In this work, we introduce an approach that exploits Facebook features to link photos related to the same event. In the current situation where the EXIF header of photos is missing in Facebook, we extract visual-based, tagged areas-based, friendship-based and structure-based features. We evaluate each of these features and use the results in our approach. We introduce and evaluate a semi-supervised probabilistic approach that takes into account the evaluation of these features. In this approach we create a lookup table of the initialization values of our model variables and make it available for other Facebook applications or researchers to use. The evaluation of our approach showed promising results and it outperformed the other the baseline method of using the unsupervised EM algorithm in estimating the parameters of a Gaussian mixture model. We also give two examples of the applicability of this approach to help Facebook applications in better serving the user.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Image content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users’ private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview on privacy related visual content available on the Web techniques to automatically detect private images, and to enable privacy-oriented image search.

To this end, we learn privacy classifiers trained on a large set of manually assessed Flickr photos, combining textual metadata of images with a variety of visual features. We employ the resulting classification models for specifically searching for private photos, and for diversifying query results to provide users with a better coverage of private and public content. Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings.

  • One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone of policy recommendation systems which can assist users too easily and properly configure privacy settings.

DISADVANTAGES:

  • Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations.
  • Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content.
  • The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

PROPOSED SYSTEM:

We propose an Adaptive Privacy Policy Prediction (A3P) system which aims to provide users a hassle free privacy settings experience by automatically generating personalized policies. The A3P system handles user uploaded images, and factors in the following criteria that influence one’s privacy settings of images:

The impact of social environment and personal characteristics: Social context of users, such as their profile information and relationships with others may provide useful information regarding users’ privacy preferences. For example, users interested in photography may like to share their photos with other amateur photographers. Users who have several family members among their social contacts may share with them pictures related to family events. However, using common policies across all users or across users with similar traits may be too simplistic and not satisfy individual preferences.

Users may have drastically different opinions even on the same type of images. For example, a privacy adverse person may be willing to share all his personal images while a more conservative person may just want to share personal images with his family members. In light of these considerations, it is important to find the balancing point between the impact of social environment and users’ individual characteristics in order to predict the policies that match each individual’s needs.

The role of image’s content and metadata: In general, similar images often incur similar privacy preferences, especially when people appear in the images. For example, one may upload several photos of his kids and specify that only his family members are allowed to see these photos. He may upload some other photos of landscapes which he took as a hobby and for these photos, he may set privacy preference allowing anyone to view and comment the photos. Analyzing the visual content may not be sufficient to capture users’ privacy preferences. Tags and other metadata are indicative of the social context of the image, including where it was taken and why, and also provide a synthetic description of images, complementing the information obtained from visual content analysis.

ADVANTAGES:

  • The A3P-core focuses on analyzing each individual user’s own images and metadata, while the A3P-Social offers a community perspective of privacy setting recommendations for a user’s potential privacy improvement.
  • Our algorithm in A3P-core (that is now parameterized based on user groups and also factors in possible outliers), and a new A3P-social module that develops the notion of social context to refine and extend the prediction power of our system.
  • We design the interaction flows between the two building blocks to balance the benefits from meeting personal characteristics and obtaining community advice.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007

PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA

ABSTRACT:

Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% precision. The results can be helpful for public health surveillance, emergency department preparedness, and, targeted patient interventions.

INTRODUCTION:

Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.

Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.

Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

Another notable non-traditional disease surveillance systemhas been a data-aggregating tool called Google Flu Trends which uses aggregated search data to estimate flu activity. Google Trends was quite successful in its estimation of influenza-like illness. It is based on Google’s search engine which tracks how often a particular search-term is entered relative to the total search-volume across a particular area. This enables access to the latest data from web search interest trends on a variety of topics, including diseases like asthma. Air pollutants are known triggers for asthma symptoms and exacerbations. The United States Environmental Protection Agency (EPA) provides access to monitored air quality data collected at outdoor sensors across the country which could be used as a data source for asthma prediction. Meanwhile, as health reform progresses, the quantity and variety of health records being made available electronically are increasing dramatically. In contrast to traditional disease surveillance systems, these new data sources have the potential to enable health organizations to respond to chronic conditions, like asthma, in real time. This in turn implies that health organizations can appropriately plan for staffing and equipment availability in a flexible manner. They can also provide early warning signals to the people at risk for asthma adverse events, and enable timely, proactive, and targeted preventive and therapeutic interventions.

LITRATURE SURVEY:

USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION

AUTHOR: Kim, Eui-Ki, et al.

PUBLISH: PloS one vol. 8, no.7, e69305, 2013.

EXPLANATION:

Influenza epidemics arise through the accumulation of viral genetic changes. The emergence of new virus strains coincides with a higher level of influenza-like illness (ILI), which is seen as a peak of a normal season. Monitoring the spread of an epidemic influenza in populations is a difficult and important task. Twitter is a free social networking service whose messages can improve the accuracy of forecasting models by providing early warnings of influenza outbreaks. In this study, we have examined the use of information embedded in the Hangeul Twitter stream to detect rapidly evolving public awareness or concern with respect to influenza transmission and developed regression models that can track levels of actual disease activity and predict influenza epidemics in the real world. Our prediction model using a delay mode provides not only a real-time assessment of the current influenza epidemic activity but also a significant improvement in prediction performance at the initial phase of ILI peak when prediction is of most importance.

A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS

AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.

PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.

EXPLANATION:

Traditional disease surveillance is a very time consuming reporting process. Cases of notifiable diseases are reported to the different levels in the national health care system before actions can be taken. But, early detection of disease activity followed by a rapid response is crucial to reduce the impact of epidemics. To address this challenge, alternative sources of information are investigated for disease surveillance. In this paper, the relevance of twitter messages outbreak detection is investigated from two directions. First, Twitter messages potentially related to disease outbreaks are retrospectively searched and analyzed. Second, incoming twitter messages are assessed with respect to their relevance for outbreak detection. The studies show that twitter messages can be – to a certain extent – highly relevant for early detecting hints to public health threats. According to the law on German Protection against Infection Act (Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance relies on data from mandatory reporting of cases by physicians and laboratories. They inform local county health departments (Landkreis) which in turn report to state health departments (Land). At the end of the reporting pipeline, the national surveillance institute (Robert Koch Institute) is informed about the outbreak. It is clear that these different stages of reporting take time and delay a timely reaction.

TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES

AUTHOR: Culotta, Aron.

PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.

EXPLANATION:

Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.

Google messages and their relevance for disease outbreak detection has been reported already that especially tweets are useful to predict outbreaks such as a Norovirus outbreak at a university analysed twitter news during the influenza epidemic 2009. They compared the use of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed the content of the tweets (ten content concepts) and validated twitter as a the real time content. They analysed the data via Infovigil an infosurveillance system by using an automated coding. To find out if there is a relationship between automated and manual coding, the tweets were evaluated by a Pearson´s correlation. Chew et al. found a significant correlation between both coding in seven content concept it needs to be investigated whether this source might be of relevance for detecting disease outbreaks in Germany. Therefore, only German keywords are exploited to identify Twitter messages. Further, we are not only interested in influenza-like illnesses as the studies available so far, but also in other infectious diseases (e.g. Norovirus and Salmonella).

DISADVANTAGES:

Existing methods have a common format: [username] [text] [date time client]. The length is restricted to 140 characters. In terms of linguistics, each twitter user can write as he or she likes. Thus, the variety reaches from complete sentences to listing of keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote topics and are primarily utilized by experienced users categories google according to their contents in more details, google messages can • Provide information, • Express opinions or • Report personal issues is provided, the authority of that information cannot normally not be determined, so it might be unverified information. Opinions are often expressed with humor or sarcasm and may be highly contradictive in the emotions that are expressed.

PROPOSED SYSTEM:

Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

Research studies have explored the use of novel data sources to propose rapid, cost-effective health status surveillance methodologies. Some of the early studies rely on document classification suggesting that Twitter data can be highly relevant for early detection of public health threats. Others employ more complex linguistic analysis, such as the Ailment Topic Aspect Model which is useful for syndrome surveillance. This type of analysis is useful for demonstrating the significance of social media as a promising new data source for health surveillance. Other recent studies have linked social media data with real world disease incidence to generate actionable knowledge useful for making health care decisions. These include which analyzed Twitter messages related to influenza and correlated them with reported CDC statistics validated Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Collier employed supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories. This study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data.

ADVANTAGES:

Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.

The main contributions of this work are:

  • Analysis of tweets with respect to their relevance for disease surveillance,
  • Content analysis and content classification of tweets,
  • Linguistic analysis of disease-reporting twitter messages,
  • Recommendations on search patterns for tweet search in the context of disease surveillance.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007

PERFORMING INITIATIVE DATA PREFETCHING IN DISTRIBUTED FILE SYSTEMS FOR CLOUD COMPUTING

ABSTRACT:

An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.

Finally, the prefetched data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed file systems for cloud environments to achieve better I/O performance. In particular, configurationlimited client machines in the cloud are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.

INTRODUCTION

The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.

However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.

On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.

Furthermore, considering only disk I/O events can reveal the disk tracks that can offer critical information to perform I/O optimization tactics certain prefetching techniques have been proposed in succession to read the data on the disk in advance after analyzing disk I/O traces. But, this kind of prefetching only works for local file systems, and the prefetched data iscached on the local machine to fulfill the application’s I/O requests passively in brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers.

LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files’ metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

SYSTEM ANALYSIS

EXISTING SYSTEM:

The file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of data intensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications benchmark to create OLTP workloads, since it is able to create similar OLTP workloads that exist in real systems. All the configured client file systems executed the same script, and each of them run several threads that issue OLTP requests. Because Sysbench requires MySQL installed as a backend for OLTP workloads, we configured mysqld process to 16 cores of storage servers. As a consequence, it is possible to measure the response time to the client request while handling the generated workloads.

DISADVANTAGES:

  • Network delay in numerically and geographically remote file system access
  • Mobile devices generally have limited processing power, battery life and storage

PROPOSED SYSTEM:

Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.

This paper makes the following two contributions:

1) Chaotic time series prediction and linear regression prediction to forecast disk I/O access. We have modeled the disk I/O access operations, and classified them into two kinds of access patterns, i.e. the random access pattern and the sequential access pattern. Therefore, in order to predict the future I/O access that belongs to the different access patterns as accurately as possible (note that the future I/O access indicates what data will be requested in the near future), two prediction algorithms including the chaotic time series prediction algorithm and the linear regression prediction algorithm have been proposed respectively. 2) Initiative data prefetching on storage servers. Without any intervention from client file systems except for piggybacking their information onto relevant I/O requests to the storage servers. The storage servers are supposed to log disk I/O access and classify access patterns after modeling disk I/O events. Next, by properly using two proposed prediction algorithms, the storage servers can predict the future disk I/O access to guide prefetching data. Finally, the storage servers proactively forward the prefetched data to the relevant client file systems for satisfying future application’s requests.

ADVANTAGES:

  • The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.
  • Less file operations via batched I/O requests through prefetching
  • Cloud computing offers an illusion of infinite computing resources

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

JAVA

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Script :           Java Script
  • Document :           MS-Office 2007

PASSIVE IP TRACEBACK: DISCLOSING THE LOCATIONS OF IP SPOOFERS FROM PATH BACKSCATTER

ABSTRACT:

It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.

This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.

This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.

These results can help further reveal IP spoofing, which has been studied for long but never well understood. Though PIT cannot work in all the spoofing attacks, it may be the most useful mechanism to trace spoofers before an Internet-level traceback system has been deployed in real.

INTRODUCTION

IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.

To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.

Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

Our work has the following contributions:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities. Though Moore et al. [8] has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback. 2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real. 3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

LITRATURE SURVEY

DEFENSE AGAINST SPOOFED IP TRAFFIC USING HOP-COUNT FILTERING

PUBLICATION: IEEE/ACM Trans. Netw., vol. 15, no. 1, pp. 40–53, Feb. 2007.

AUTHORS: H. Wang, C. Jin, and K. G. Shin

EXPLANATION:

IP spoofing has often been exploited by Distributed Denial of Service (DDoS) attacks to: 1)conceal flooding sources and dilute localities in flooding traffic, and 2)coax legitimate hosts into becoming reflectors, redirecting and amplifying flooding traffic. Thus, the ability to filter spoofed IP packets near victim servers is essential to their own protection and prevention of becoming involuntary DoS reflectors. Although an attacker can forge any field in the IP header, he cannot falsify the number of hops an IP packet takes to reach its destination. More importantly, since the hop-count values are diverse, an attacker cannot randomly spoof IP addresses while maintaining consistent hop-counts. On the other hand, an Internet server can easily infer the hop-count information from the Time-to-Live (TTL) field of the IP header. Using a mapping between IP addresses and their hop-counts, the server can distinguish spoofed IP packets from legitimate ones. Based on this observation, we present a novel filtering technique, called Hop-Count Filtering (HCF)-which builds an accurate IP-to-hop-count (IP2HC) mapping table-to detect and discard spoofed IP packets. HCF is easy to deploy, as it does not require any support from the underlying network. Through analysis using network measurement data, we show that HCF can identify close to 90% of spoofed IP packets, and then discard them with little collateral damage. We implement and evaluate HCF in the Linux kernel, demonstrating its effectiveness with experimental measurements

DYNAMIC PROBABILISTIC PACKET MARKING FOR EFFICIENT IP TRACEBACK

PUBLICATION: Comput. Netw., vol. 51, no. 3, pp. 866–882, 2007.

AUTHORS: J. Liu, Z.-J. Lee, and Y.-C. Chung

EXPLANATION:

Recently, denial-of-service (DoS) attack has become a pressing problem due to the lack of an efficient method to locate the real attackers and ease of launching an attack with readily available source codes on the Internet. Traceback is a subtle scheme to tackle DoS attacks. Probabilistic packet marking (PPM) is a new way for practical IP traceback. Although PPM enables a victim to pinpoint the attacker’s origin to within 2–5 equally possible sites, it has been shown that PPM suffers from uncertainty under spoofed marking attack. Furthermore, the uncertainty factor can be amplified significantly under distributed DoS attack, which may diminish the effectiveness of PPM. In this work, we present a new approach, called dynamic probabilistic packet marking (DPPM), to further improve the effectiveness of PPM. Instead of using a fixed marking probability, we propose to deduce the traveling distance of a packet and then choose a proper marking probability. DPPM may completely remove uncertainty and enable victims to precisely pinpoint the attacking origin even under spoofed marking DoS attacks. DPPM supports incremental deployment. Formal analysis indicates that DPPM outperforms PPM in most aspects.

FLEXIBLE DETERMINISTIC PACKET MARKING: AN IP TRACEBACK SYSTEM TO FIND THE REAL SOURCE OF ATTACKS

PUBLICATION: EEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.

AUTHORS: Y. Xiang, W. Zhou, and M. Guo

EXPLANATION:

IP traceback is the enabling technology to control Internet crime. In this paper we present a novel and practical IP traceback system called Flexible Deterministic Packet Marking (FDPM) which provides a defense system with the ability to find out the real sources of attacking packets that traverse through the network. While a number of other traceback schemes exist, FDPM provides innovative features to trace the source of IP packets and can obtain better tracing capability than others. In particular, FDPM adopts a flexible mark length strategy to make it compatible to different network environments; it also adaptively changes its marking rate according to the load of the participating router by a flexible flow-based marking scheme. Evaluations on both simulation and real system implementation demonstrate that FDPM requires a moderately small number of packets to complete the traceback process; add little additional load to routers and can trace a large number of sources in one traceback process with low false positive rates. The built-in overload prevention mechanism makes this system capable of achieving a satisfactory traceback result even when the router is heavily loaded. It has been used to not only trace DDoS attacking packets but also enhance filtering attacking traffic.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Existing methods of the IP marking approach is that routers probabilistically write some encoding of partial path information into the packets during forwarding. A basic technique, the edge sampling algorithm, is to write edge information into the packets. This scheme reserves two static fields of the size of IP address, start and end, and a static distance field in each packet. Each router updates these fields as follows. Each router marks the packet with a probability. When the router decides to mark the packet, it writes its own IP address into the start field and writes zero into the distance field. Otherwise, if the distance field is already zero which indicates its previous router marked the packet, it writes its own IP address into the end field, thus represents the edge between itself and the previous routers.

Previous router doesn’t mark the packet, then it always increments the distance field. Thus the distance field in the packet indicates the number of routers the packet has traversed from the router which marked the packet to the victim. The distance field should be updated using a saturating addition, meaning that the distance field is not allowed to wrap. The mandatory increment of the distance field is used to avoid spoofing by an attacker. Using such a scheme, any packet written by the attacker will have distance field greater than or equal to the length of the real attack path a router false positive if it is in the reconstructed attack graph but not in the real attack graph. Similarly we call a router false negative if it is in the true attack graph but not in the reconstructed attack graph. We call a solution to the IP traceback problem robust if it has very low rate of false negatives and false positives.

DISADVANTAGES:

  • Existing approach has a very high computation overhead for the victim to reconstruct the attack paths, and gives a large number of false positives when the denial-of-service attack originates from multiple attackers.
  • Existing approach can require days of computation to reconstruct the attack paths and give thousands of false positives even when there are only 25 distributed attackers. This approach is also vulnerable to compromised routers.
  • If a router is compromised, it can forge markings from other uncompromised routers and hence lead the reconstruction to wrong results. Even worse, the victim will not be able to tell a router is compromised just from the information in the packets it receives problem.

PROPOSED SYSTEM:

We propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

We present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

ADVANTAGES:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback.

2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real.

3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Document :           MS-Office 2007

Optimal Configuration of Network Coding in Ad Hoc Networks

Abstract: 

Analyze the impact of network coding (NC) configuration on the performance of ad hoc networks with the consideration of two significant factors, namely, the throughput loss and the decoding loss, which are jointly treated as the overhead of NC. In particular, physical-layer NC and random linear NC are adopted in static and mobile ad hoc networks (MANETs), respectively. Furthermore, we characterize the good put and delay/good put tradeoff in static networks, which are also analyzed in MANETs for different mobility models (i.e., the random independent and identically distributed mobility model and the random walk model) and transmission schemes.

Introduction:

Network coding was initially designed as a kind of Source coding. Further studies showed that the Capacity of wired networks can be improved by network coding (NC), which can fully utilize the network resources.

Due to This advantage, how to employ NC in wireless ad hoc networks has been intensively studied in recent years with the Purpose of improving the throughput and delay performance. The main difference between wired networks and wireless Networks is that there is non ignorable interference between Nodes in wireless networks.

 Therefore, it is important to design the NC in wireless ad hoc networks with interference to achieve the improvement on system performance such as good put and delay/good put tradeoff.

Existing System:

The probability that the random linear NC was valid for a multicast connection problem on an arbitrary network with independent sources was at least (1 d/q)η, where η was the number of links with associated random coefficients, d was the number of receivers, and q was the size of Galois field Fq.

It was obvious that a large q was required to guarantee that the system with RLNC was valid. When considering the given two factors, the traditional definition of throughput in ad hoc networks is no longer appropriate since it does not consider the bits of NC coefficients and the linearly correlated packets that do not carry any valuable data. Instead, the good put and the delay/good put tradeoff are investigated in this paper, which only take into account the successfully decoded data.

Moreover, if we treat the data size of each packet, the generation size (the number of packets that are combined by NC as a group), and the NC coefficient Galois field as the configuration of NC, it is necessary to find the scaling laws of the optimal configuration for a given network model and transmission scheme.

Disadvantages:

  • Throughput loss.
  • The decoding loss.
  • Time delay.

Proposed System:

Proposed system with the basic idea of NC and the scaling laws of throughput loss and decoding loss. Furthermore, some useful concepts and parameters are listed. Finally, we give the definitions of some network performance metrics.

Physical layer Network Coding designed based on the channel state information (CSI) and network topology. The PNC is appropriate for the static networks since the CSI and network topology are preknown in the static case.

There are G nodes in one cell, and node i (i = 1, 2, . . . , G) holds packet xi. All of the G packets are independent, and they belong to the same unicast session. The packets are transmitted to a node i’ in the next cell simultaneously. gii’ is a complex number that represents the CSI between i and i’ in the frequency domain.

Advantages:

  • System minimizes data loss.
  • System reduces time delay.

Modules:

Network Topology:

The networks that consist of n randomly and evenly distributed static nodes in a unit square area. These nodes are randomly grouped into S–D pairs.

Transmission Model:

The protocol model, which is a simplified version of the physical model since it ignores the long-distance interference and transmission. Moreover, it is indicated in that the physical model can be treated as the protocol model on scaling laws when the transmission is allowed if the signal-to-interference-plus-noise ratio is larger than a given threshold.

Transmission Schemes for Mobile Networks:

When the relay receives a new packet, it combines the packet it has with that it receives by randomly selected coefficients and then generates a new packet. Simultaneous transmission in one cell is not allowed since it is hard for the receiver to obtain multiple CSI from different transmitters at the same time. Hence, we employ the random linear NC for mobile models.

Conclusion:

Analyzed the NC configuration in both static and mobile ad hoc networks to optimize the delay good put tradeoff and the good put with the consideration of the

Through put loss and decoding loss of NC. These results reveal the impact of network scale on the NC system, which has not been studied in previous works. Moreover, we also compared the performance with the corresponding networks without NC.

ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICATIONS

ABSTRACT:

MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

INTRODUCTION

MapReduce has emerged as the most popular computing framework for big data processing due to its simple programming model and automatic management of parallel execution. MapReduce and its open source implementation Hadoop have been adopted by leading companies, such as Yahoo!, Google and Facebook, for various big data applications, such as machine learning bioinformatics and cybersecurity. MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result.

There is a shuffle step between map and reduce phase.

In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications. For example, with tens of thousands of machines, data shuffling accounts for 58.6% of the cross-pod traffic and amounts to over 200 petabytes in total in the analysis of SCOPE jobs. For shuffle-heavy MapReduce tasks, the high traffic could incur considerable performance overhead up to 30-40 % as shown in default, intermediate data are shuffled according to a hash function in Hadoop, which would lead to large network traffic because it ignores network topology and data size associated with each key.

We consider a toy example with two map tasks and two reduce tasks, where intermediate data of three keys K1, K2, and K3 are denoted by rectangle bars under each machine. If the hash function assigns data of K1 and K3 to reducer 1, and K2 to reducer 2, a large amount of traffic will go through the top switch. To tackle this problem incurred by the traffic-oblivious partition scheme, we take into account of both task locations and data size associated with each key in this paper. By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced. In the same example above, if we assign K1 and K3 to reducer 2, and K2 to reducer 1, as shown in Fig. 1(b), the data transferred through the top switch will be significantly reduced.

To further reduce network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced.

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

LITRATURE SURVEY

MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

AUTHOR: Dean and S. Ghemawat

PUBLISH: Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

EXPLANATION:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

CLOUDBLAST: COMBINING MAPREDUCE AND VIRTUALIZATION ON DISTRIBUTED RESOURCES FOR BIOINFORMATICS APPLICATIONS

AUTHOR: A. Matsunaga, M. Tsugawa, and J. Fortes,

PUBLISH: IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.

EXPLANATION:

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

MAP TASK SCHEDULING IN MAPREDUCE WITH DATA LOCALITY: THROUGHPUT AND HEAVY-TRAFFIC OPTIMALITY

AUTHOR: W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang

PUBLISH: INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.

EXPLANATION:

Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay.

We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little’s law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Existing problem of optimizing network usage in MapReduce scheduling in the reason that we are interested in network usage is twofold. Firstly, network utilization is a quantity of independent interest, as it is directly related to the throughput of the system. Note that the total amount of data processed in unit time is simply (CPU utilization)·(CPU capacity)+ (network utilization)·(network capacity). CPU utilization will always be 1 as long as there are enough jobs in the map queue, but network utilization can be very sensitive to scheduling network utilization has been identified as a key component in optimization of MapReduce systems in several previous works.

Network usage could lead us to algorithms with smaller mean response time. We find the main motivation for this direction of our work in the results of the aforementioned overlap between map and shuffle phases, are shown to yield significantly better mean response time than Hadoop’s fair scheduler. However, we observed that neither of these two algorithms explicitly attempted to optimize network usage, which suggested room for improvement. MapReduce has become one of the most popular frameworks for large-scale distributed computing, there exists a huge body of work regarding performance optimization of MapReduce.

For instance, researchers have tried to optimize MapReduce systems by efficiently detecting and eliminating the so-called “stragglers” providing better locality of data preventing starvation caused by large jobs analyzing the problem from a purely theoretical viewpoint of shuffle workload available at any given time is closely related to the output rate of the map phase, due to the inherent dependency between the map and shuffle phases. In particular, when the job that is being processed is ‘map-heavy,’ the available workload of the same job in the shuffle phase is upper-bounded by the output rate of the map phase. Therefore, poor scheduling of map tasks can have adverse effects on the throughput of the shuffle phase, causing the network to be idle and the efficiency of the entire system to decrease.

DISADVANTAGES:

Existing model, called the overlapping tandem queue model, is a job-level model for MapReduce where the map and shuffle phases of the MapReduce framework are modeled as two queues that are put in tandem. Since it is a job-level model, each job is represented by only the map size and the shuffle size simplification is justified by the introduction of two main assumptions. The first assumption states that each job consists of a large number of small-sized tasks, which allows us to represent the progress of each phase by real numbers.

The job-level model offers two big disadvantages over the more complicated task-level models.

Firstly, it gives rise to algorithms that are much simpler than those of task-level models, which enhances chances of being deployed in an actual system.

Secondly, the number of jobs in a system is often smaller than the number of tasks by several orders of magnitude, making the problem computationally much less strenuous note that there are still some questions to be studied regarding the general applicability of the additional assumptions of the job-level model, which are interesting research questions in their own light

PROPOSED SYSTEM:

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines in this locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

We have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

ADVANTAGES:

  • Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations.
  • Our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger.
  • Our distributed algorithm with the other two schemes a default simulation setting with a number of parameters, and then study the performance by changing one parameter while fixing others. We consider a MapReduce job with 100 keys and other parameters are the same above. the network traffic cost shows as an increasing function of number of keys from 1 to 100 under all algorithms.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Script :           Java Script
  • Tool :           Netbean 7
  • Document :           MS-Office 2007

Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care

Abstract

Intelligently extracting knowledge from social media has recently attracted great interest from the Biomedical and Health Informatics community to simultaneously improve healthcare outcomes and reduce costs using consumer-generated opinion. We propose a two-step analysis framework that focuses on positive and negative sentiment, as well as the side effects of treatment, in users’ forum posts, and identifies user communities (modules) and influential users for the purpose of ascertaining user opinion of cancer treatment. We used a self-organizing map to analyze word frequency data derived from users’ forum posts. We then introduced a novel network-based approach for modeling users’ forum interactions and employed a network partitioning method based on optimizing a stability quality measure. This allowed us to determine consumer opinion and identify influential users within the retrieved modules using information derived from both word-frequency data and network-based properties. Our approach can expand research into intelligently mining social media data for consumer opinion of various treatments to provide rapid, up-to-date information for the pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of future treatments. Index Terms—Data mining, complex networks, neural networks, semantic web, social computing.

INTRODUCTION

Social media is providing limitless opportunities for patients to discuss their experiences with drugs and devices, and for companies to receive feedback on their products and services [1]–[3]. Pharmaceutical companies are prioritizing social network monitoring within their IT departments, creating an opportunity for rapid dissemination and feedback of products and services to optimize and enhance delivery, increase turnover and profit, and reduce costs [4]. Social media data harvesting for bio-surveillance have also been reported [5]. Social media enables a virtual networking environment. Modeling social media using available network modeling and computational tools is one way of extracting knowledge and trends from the information ‘cloud:’ a social network is a structure made of nodes and edges that connect nodes in various relationships. Graphical representation is the most common method to visually represent the information. Network modeling could also be used for studying the simulation of network properties and its internal dynamics.

A sociomatrix can be used to construct representations of a social network structure. Node degree, network density, and other large-scale parameters can derive information about the importance of certain entities within the network. Such communities are clusters or modules. Specific algorithms can perform network-clustering, one of the fundamental tasks in network analysis. Detecting particular user communities requires identifying specific, networked nodes that will allow information extraction. Healthcare providers could use patient opinion to improve their services. Physicians could collect feedback from other doctors and patients to improve their treatment recommendations and results. Patients could use other consumers’ knowledge in making better-informed healthcare decisions.

The nature of social networks makes data collection difficult. Several methods have been employed, such as link mining [6], classification through links [7], predictions based on objects [8], links [9], existence [10], estimation [11], object [12], group [13], and subgroup detection [14], and mining the data [15], [16]. Link prediction, viral marketing, online discussion groups (and rankings) allow for the development of solutions based on user feedback.

Traditional social sciences use surveys and involve subjects in the data collection process, resulting in small sample sizes per study. With social media, more content is readily available, particularly when combined with web-crawling and scraping software that would allow real-time monitoring of changes within the network. Previous studies used technical solutions to extract user sentiment on influenza [17], technology stocks [18], context and sentence structure [19], online shopping [20], multiple classifications [21], government health monitoring [22], specific terms relating to consumer satisfaction [23], polarity of newspaper articles [24], and assessment of user satisfaction from companies [25], [26]. Despite the extensive literature, none have identified influential users, and how forum relationships affect network dynamics. In the first stage of our current study, we employ exploratory analysis using the self-organizing maps (SOMs) to assess correlations between user posts and positive or negative opinion on the drug. In the second stage, we model the users and their posts using a network-based approach.

We build on our previous study [27] and use an enhanced method for identifying user communities (modules) and influential users therein. The current approach effectively searches for potential levels of organization (scales) within the networks and uncovers dense modules using a partition stability quality measure [28]. The approach enables us to find the optimal network partition. We subsequently enrich the retrieved modules with word frequency information from module-contained users posts to derive local and global measures of users opinion and raise flag on potential side effects of Erlotinib, a drug used in the treatment of one of the most prevalent cancers: lung cancer [29].

NEIGHBOR SIMILARITY TRUST AGAINST SYBIL ATTACK IN P2P E-COMMERCE

ABSTRACT:

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

  1. Fake Users Enters Easy.
  2. This makes Sybil attacks.

PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

  • It is Helpful to find Sybil Attacks.
  • It is used to Find Fake UserID.
  • It is feasible to limit the number of attack edges in online social networks by relationship rating.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Script :           Java Script
  • Tools :           Netbeans 7
  • Document :           MS-Office 2007

MAXIMIZING P2P FILE ACCESS AVAILABILITY IN MOBILE ADHOC NETWORKS THOUGH REPLICATION FOR EFFICIENT FILE SHARING

ABSTRACT:

File sharing applications in mobile ad hoc networks (MANETs) have attracted more and more attention in recent years. The efficiency of file querying suffers from the distinctive properties of such networks including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica creation with minimum average querying delay.

Specifically, current file replication protocols in mobile ad hoc networks have two shortcomings. First, they lack a rule to allocate limited resources to different files in order to minimize the average querying delay. Second, they simply consider storage as available resources for replicas, but neglect the fact that the file holders’ frequency of meeting other nodes also plays an important role in determining file availability. Actually, a node that has a higher meeting frequency with others provides higher availability to its files. This becomes even more evident in sparsely distributed MANETs, in which nodes meet disruptively.

In this paper, we introduce a new concept of resource for file replication, which considers both node storage and node meeting ability. We theoretically study the influence of resource allocation on the average querying delay and derive an optimal file replication rule (OFRR) that allocates resources to each file based on its popularity and size. We then propose a file replication protocol based on the rule, which approximates the minimum global querying delay in a fully distributed manner. Our experiment and simulation results show the superior performance of the proposed protocol in comparison with other representative replication protocols.

INTRODUCTION

With the increasing popularity of mobile devices, e.g., smartphones and laptops, we envision the future of MANETs consisted of these mobile devices. By MANETs, we refer to both normal MANETs and disconnected MANETs, also known as delay tolerant networks (DTNs). The former has a relatively dense node distribution in an area while the latter has sparsely distributed nodes that meet each other opportunistically. On the other side, the emerging of mobile file sharing applications on the peer-to-peer (P2P) file sharing over such MANETs. The local P2P file sharing model provides three advantages. First, it enables file sharing when no base stations are available (e.g., in rural areas). Second, with the P2P architecture, the bottleneck on overloaded servers in current clientserver based file sharing systems can be avoided. Third, it exploits otherwise wasted peer to peer communication opportunities among mobile nodes. As a result, nodes can freely and unobtrusively access and share files in the distributed MANET environment, which can possibly support interesting applications.

For example, mobile nodes can share files based on users’ proximity in the same building or in a local community. Tourists can share their travel experiences or emergency information with other tourists through digital devices directly even when no base station is available in remote areas. Drivers can share road information through the vehicle-to-vehicle communication. However, the distinctive properties of MANETs, i.e., node mobility, limited communication range and resource, have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range. Broadcasting can quickly discover files, but it leads to the broadcast storm problem with high energy consumption.

Probabilistic routing and file discovery protocols avoid broadcasting by forwarding a query to a node with higher probability of meeting the destination. But the opportunistic encountering of nodes in MANETs makes file searching and retrieval non-deterministic. File replication is an effective way to enhance file availability and reduce file querying delay. It creates replicas for a file to improve its probability of being encountered by requests. Unfortunately, it is impractical and inefficient to enable every node to hold the replicas of all files in the system considering limited node resources. Also, file querying delay is always a main concern in a file sharing system. Users often desire to receive their requested files quickly no matter whether the files are popular or not. Thus, a critical issue is raised for further investigation: how to allocate the limited resource in the network to different files for replication so that the overall average file querying delay is minimized? Recently, a number of file replication protocols have been proposed for MANETs. In these protocols, each individual node replicates files it frequently queries or a group of nodes create one replica for each file they frequently query. In the former, redundant replicas are easily created in the system, thereby wasting resources.

In the latter, though redundant replicas are reduced by group based cooperation, neighboring nodes may separate from each other due to node mobility, leading to large query delay. There are also some works addressing content caching in disconnected MANETs/ DTNs for efficient data retrieval or message routing. They basically cache data that are frequently queried on places that are visited frequently by mobile nodes. Both the two categories of replication methods fail to thoroughly consider that a node’s mobility affects the availability of its files. In spite of efforts, current file replication protocols lack a rule to allocate limited resources to files for replica creation in order to achieve the minimum average querying delay, i.e., global search efficiency optimization under limited resources. They simply consider storage as the resource for replicas, but neglect that a node’s frequency to meet other nodes (meeting ability in short) also influences the availability of its files. Files in a node with a higher meeting ability have higher availability.

LITRATURE SURVEY

CONTACT DURATION AWARE DATA REPLICATION IN DELAY TOLERANT NETWORKS

AUTHOR: X. Zhuo, Q. Li, W. Gao, G. Cao, and Y. Dai

PUBLISH: Proc. IEEE 19th Int’l Conf. Network Protocols (ICNP), 2011.

EXPLANATION:

The recent popularization of hand-held mobile devices, such as smartphones, enables the inter-connectivity among mobile users without the support of Internet infrastructure. When mobile users move and contact each other opportunistically, they form a Delay Tolerant Network (DTN), which can be exploited to share data among them. Data replication is one of the common techniques for such data sharing. However, the unstable network topology and limited contact duration in DTNs make it difficult to directly apply traditional data replication schemes. Although there are a few existing studies on data replication in DTNs, they generally ignore the contact duration limits. In this paper, we recognize the deficiency of existing data replication schemes which treat the complete data item as the replication unit, and propose to replicate data at the packet level. We analytically formulate the contact duration aware data replication problem and give a centralized solution to better utilize the limited storage buffers and the contact opportunities. We further propose a practical contact Duration Aware Replication Algorithm (DARA) which operates in a fully distributed manner and reduces the computational complexity. Extensive simulations on both synthetic and realistic traces show that our distributed scheme achieves close-to-optimal performance, and outperforms other existing replication schemes.

SOCIAL-BASED COOPERATIVE CACHING IN DTNS: A CONTACT DURATION AWARE APPROACH

AUTHOR: X. Zhuo, Q. Li, G. Cao, Y. Dai, B.K. Szymanski, and T.L. Porta,

PUBLISH: Proc. IEEE Eighth Int’l Conf. Mobile Adhoc and Sensor Systems (MASS), 2011.

EXPLANATION:

Data access is an important issue in Delay Tolerant Networks (DTNs), and a common technique to improve the performance of data access is cooperative caching. However, due to the unpredictable node mobility in DTNs, traditional caching schemes cannot be directly applied. In this paper, we propose DAC, a novel caching protocol adaptive to the challenging environment of DTNs. Specifically, we exploit the social community structure to combat the unstable network topology in DTNs. We propose a new centrality metric to evaluate the caching capability of each node within a community, and solutions based on this metric are proposed to determine where to cache. More importantly, we consider the impact of the contact duration limitation on cooperative caching, which has been ignored by the existing works. We prove that the marginal caching benefit that a node can provide diminishes when more data is cached. We derive an adaptive caching bound for each mobile node according to its specific contact patterns with others, to limit the amount of data it caches. In this way, both the storage space and the contact opportunities are better utilized. To mitigate the coupon collector’s problem, network coding techniques are used to further improve the caching efficiency. Extensive trace-driven simulations show that our cooperative caching protocol can significantly improve the performance of data access in DTNs.

SEDUM: EXPLOITING SOCIAL NETWORKS IN UTILITY-BASED DISTRIBUTED ROUTING FOR DTNS

AUTHOR: Z. Li and H. Shen

PUBLISH: IEEE Trans. Computers, vol. 62, no. 1, pp. 83-97, Jan. 2012.

EXPLANATION:

However, current probabilistic forwarding methods only consider node contact frequency in calculating the utility while neglecting the influence of contact duration on the throughput, though both contact frequency and contact duration reflect the node movement pattern in a social network. In this paper, we theoretically prove that considering both factors leads to higher throughput than considering only contact frequency. To fully exploit a social network for high throughput and low routing delay, we propose a Social network oriented and duration utility-based distributed multicopy routing protocol (SEDUM) for DTNs. SEDUM is distinguished by three features. First, it considers both contact frequency and duration in node movement patterns of social networks. Second, it uses multicopy routing and can discover the minimum number of copies of a message to achieve a desired routing delay. Third, it has an effective buffer management mechanism to increase throughput and decrease routing delay. Theoretical analysis and simulation results show that SEDUM provides high throughput and low routing delay compared to existing routing approaches. The results conform to our expectation that considering both contact frequency and duration for delivery utility in routing can achieve higher throughput than considering only contact frequency, especially in a highly dynamic environment with large routing messages.

SYSTEM ANALYSIS

EXISTING SYSTEM:

This work focuses on Delay Tolerant Networks (DTNs) in a social network environment. DTNs do not have a complete path from a source to a destination most of the time. Previous data routing approaches in DTNs are primarily based on either flooding or single-copy routing. However, these methods incur either high overhead due to excessive transmissions or long delays due to suboptimal choices for relay nodes. Probabilistic forwarding that forwards a message to a node with a higher delivery utility enhances single-copy routing.

Previous file sharing applications in mobile ad hoc networks (MANETs) have attracted more efficiency of file querying suffers from the distinctive properties of MANETs including node mobility and limited communication range and resource. An intuitive method to alleviate this problem is to create file replicas in the network. However, despite the efforts on file replication, no research has focused on the global optimal replica sharing with minimum average querying delay communication links between mobile nodes are transient and network maintenance overhead is a major performance bottleneck for data transmission. Low node density makes it difficult to establish end-to-end connection, thus impeding a continuous end-to-end path between a source and a destination.

DTN networks for communication in outer space, but is now directly accessible from our pockets both the characteristics of MANETs and the requirements of P2P file sharing an application layer overlay network. We port a DTN type solution into an infrastructure-less environment like MANETs and leverage peer mobility to reach data in other disconnected networks. This is done by implementing an asynchronous communication model, store-delegate-and-forward, like DTNs, where a peer can delegate unaccomplished file download or query tasks to special peers. To improve data transmission performance while reducing communication overhead, we select these special peers by the expectation of encountering them again in future and assign them different download starting point on the file.

DISADVANTAGES:

  • Limited communication range and resource have rendered many difficulties in realizing such a P2P file sharing system. For example, file searching turns out to be difficult since nodes in MANETs move around freely and can exchange information only when they are within the communication range.
  • The disadvantage is that it lacked of transparency. Receiving a URL explicitly points to certain data replica and that the browser will become aware of the switching between the different machines.
  • And for scalability, the necessity of making contact with is always the same, the single service machine can make it bottleneck as the number of clients increase which makes situation worse.

PROPOSED SYSTEM:

We propose a distributed file replication protocol that can approximately realize the optimal file replication rule with the two mobility models in a distributed manner in the OFRR in the two mobility models (i.e., Equations (22) and (28)) have the same form, we present the protocol in this section without indicating the specific mobility model. We first introduce the challenges to realize the OFRR and our solutions. We then propose a replication protocol to realize OFRR and analyze the effect of the protocol.

We propose the priority competition and split file replication protocol (PCS). We first introduce how a node retrieves the parameters needed in PCS and then present the detail of PCS. we briefly prove the effectiveness of PCS. We refer to the process in which a node tries to copy a file to its neighbors as one round of replica distribution. Recall that when a replica is created for a file with P, the two copies will replicate files with priority P =2 in the next round. This means that the creation of replicas will not increase the overall P of the file. Also, after each round, the priority value of each file or replica is updated based on the received requests for the file.

Then, though some replicas may be deleted in the competition, the total amount of requests for the file remains stable, making the sum of the Ps of all replicas and the original file roughly equal to the overall priority value of the file. Then, we can regard the replicas of a file as an entity that competes for available resource in the system with accumulated priority P in each round. Therefore, in each round of replica distribution, based on our design of PCS, the overall probability of creating a replica for an original file

ADVANTAGES:

The community-based mobility model has been used in content dissemination or routing algorithms for disconnected MANETs/DTNs to depict node mobility. In this model, the entire test area is split into different sub-areas, denoted as caves. Each cave holds one community.

RWP model, we can assume that the inter-meeting time among nodes follows exponential distribution. Then, the probability of meeting a node is independent with the previous encountered node. Therefore, we define the meeting ability of a node as the average number of nodes it meets in a unit time and use it to investigate the optimal file replication.

PCS, we used two routing protocols in the experiments. We first used the Static Wait protocol in the GENI experiment, in which each query stays on the source node waiting for the destination. We then used a probabilistic routing protocol (PROPHET) in which a node routes requests to the neighbor with the highest meeting ability.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM                                    –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Tools :           Netbeans 7
  • Script :           Java Script
  • Document :           MS-Office 2007

Innovative Schemes for Resource Allocation in the Cloud for Media Streaming Applications

Abstract—Media streaming applications have recently attracted a large number of users in the Internet. With the advent of these bandwidth-intensive applications, it is economically inefficient to provide streaming distribution with guaranteed QoS relying only on central resources at a media content provider. Cloud computing offers an elastic infrastructure that media content providers (e.g., Video on Demand (VoD) providers) can use to obtain streaming resources that match the demand. Media content providers are charged for the amount of resources allocated (reserved) in the cloud. Most of the existing cloud providers employ a pricing model for the reserved resources that is based on non-linear time-discount tariffs (e.g., Amazon CloudFront and Amazon EC2). Such a pricing scheme offers discount rates depending non-linearly on the period of time during which the resources are reserved in the cloud. In this case, an open problem is to decide on both the right amount of resources reserved in the cloud, and their reservation time such that the financial cost on the media content provider is minimized. We propose a simple—easy to implement—algorithm for resource reservation that maximally exploits discounted rates offered in the tariffs, while ensuring that sufficient resources are reserved in the cloud. Based on the prediction of demand for streaming capacity, our algorithm is carefully designed to reduce the risk of making wrong resource allocation decisions. The results of our numerical evaluations and simulations show that the proposed algorithm significantly reduces the monetary cost of resource allocations in the cloud as compared to other conventional schemes.

INTRODUCTION
Media streaming applications have recently attracted large number of users in the Internet. In 2010, the number of video streams served increased 38.8 percent to 24.92 billion as compared to 2009 [1]. This huge demand creates a burden on centralized data centers at media content providers such as Video-on-Demand (VoD) providers to sustain the required QoS guarantees [2]. The problem becomes more critical with the increasing demand for higher bit rates required for the growing number of higherdefinition video quality desired by consumers. In this paper, we explore new approaches that mitigate the cost of streaming distribution on media content providers using cloud computing.
A media content provider needs to equip its data-center with over-provisioned (excessive) amount of resources in order to meet the strict QoS requirements of streaming traffic. Since it is possible to anticipate the size of usage peaks for streaming capacity in a daily, weekly, monthly, and yearly basis, a media content provider can make long term investments in infrastructure (e.g., bandwidth and computing capacities) to target the expected usage peak. However, this causes economic inefficiency problems in view of flashcrowd events. Since data-centers of a media content provider are equipped with resources that target the peak expected demand, most servers in a typical data-center of a media content provider are only used at about 30 percent of their capacity [3]. Hence, a huge amount of capacity at the servers will be idle most of the time, which is highly wasteful and inefficient. Cloud computing creates the possibility for media content providers to convert the upfront infrastructure investment to operating expenses charged by cloud providers (e. g., Netflix moved its streaming servers to Amazon Web Services (AWS) [4], [5]). Instead of buying over-provisioned servers and building private data-centres, media content providers can use computing and bandwidth resources of cloud service providers. Hence, a media content provider can be viewed as a re-seller of cloud resources, where it pays the cloud service provider for the streaming resources (bandwidth) served from the cloud directly to clients of the media content provider. This paradigm reduces the expenses of media content providers in terms of purchase and maintenance of over-provisioned resources at their data-centres.
In the cloud, the amount of allocated resources can be changed adaptively at a fine granularity, which is commonly referred to as auto-scaling. The auto-scaling ability of the cloud enhances resource utilization by matching the supply with the demand. So far, CPU and memory are the common resources offered by the cloud providers (e.g., Amazon EC2 [6]). However, recently, streaming resources (bandwidth) have become a feature offered by many cloud providers to users with intensive bandwidth demand (e.g.,
Amazon CloudFront and Octoshape) [5], [7], [8], [9].

The delay sensitive nature of media streaming traffic poses unique challenges due to the need for guaranteed throughput (i.e., download rate no smaller than the video playback rate) in order to enable users to smoothly watch video content on-line. Hence, the media content provider needs to allocate streaming resources in the cloud such that the demand for streaming capacity can be sustained at any instant of time.
The common type of resource provisioning plan that is offered by cloud providers is referred to as on-demand plan. This plan allows the media content provider to purchase resources upon needed. The pricing model that cloud providers employ for the on-demand plan is the pay-peruse. Another type of streaming resource provisioning plans that is offered by many cloud providers is based on resource reservation. With the reservation plan, the media content provider allocates (reserves) resources in advance and pricing is charged before the resources are utilized (upon receiving the request by the cloud provider, i.e., prepaid resources). The reserved streaming resources are basically the bandwidth (streaming data-rate) at which the cloud provider guarantees to deliver to clients of the media content provider (content viewers) according to the required QoS.
In general, the prices (tariffs) of the reservation plan are cheaper than those of the on-demand plan (i.e., time discount rates are only offered to the reserved (prepaid) resources). We consider a pricing model for resource reservation in the cloud that is based on non-linear time-discount tariffs. In such a pricing scheme, the cloud service provider offers higher discount rates to the resources reserved in the cloud for longer times. Such a pricing scheme enables a cloud service provider to better utilize its abundantly available resources because it encourages consumers to reserve resources in the cloud for longer times.
This pricing scheme is currently being used by many cloud providers [10]. See for example the pricing of virtual machines (VM) in the reservation phase defined by Amazon EC2 in February 2010. In this case, an open problem is to decide on both the optimum amount of resources reserved in the cloud (i.e., the prepaid allocated resources), and the optimum period of time during which those resources are reserved such that the monetary cost on the media content provider is minimized. In order for a media content provider to address this problem, prediction of future demand for streaming capacity is required to help with the resource reservation planning. Many methods have been proposed in prior works to predict the demand for streaming capacity [11], [12], [13], [14].
Our main contribution in this paper is a practical—easy to implement—Prediction-Based Resource Allocation algorithm (PBRA) that minimizes the monetary cost of resource reservation in the cloud by maximally exploiting discounted rates offered in the tariffs, while ensuring that sufficient resources are reserved in the cloud with some level of confidence in probabilistic sense. We first describe the system model. We formulate the problem based on the prediction of future demand for streaming capacity (Section 3). We then describe the design of our proposed algorithm for solving the problem (Section 4).
The results of our numerical evaluations and simulations show that the proposed algorithms significantly reduce the monetary cost of resource allocations in the cloud as compared to other conventional schemes.

IMPROVING WEB NAVIGATION USABILITY BY COMPARING ACTUAL AND ANTICIPATED USAGE

ABSTRACT:

We present a new method to identify navigation related Web usability problems based on comparing actual and anticipated usage patterns. The actual usage patterns can be extracted from Web server logs routinely recorded for operational websites by first processing the log data to identify users, user sessions, and user task-oriented transactions, and then applying a usage mining algorithm to discover patterns among actual usage paths. The anticipated usage, including information about both the path and time required for user-oriented tasks, is captured by our ideal user interactive path models constructed by cognitive experts based on their cognition of user behavior.

The comparison is performed via the mechanism of test MY SQL for checking results and identifying user navigation difficulties. The deviation data produced from this comparison can help us discover usability issues and suggest corrective actions to improve usability. A software tool was developed to automate a significant part of the activities involved. With an experiment on a small service-oriented website, we identified usability problems, which were cross-validated by domain experts, and quantified usability improvement by the higher task success rate and lower time and effort for given tasks after suggested corrections were implemented. This case study provides an initial validation of the applicability and effectiveness of our method.

INTRODUCTION

As the World Wide Web becomes prevalent today, building and ensuring easy-to-use Web systems is becoming a core competency for business survival. Usability is defined as the effectiveness, efficiency, and satisfaction with which specific users can complete specific tasks in a particular environment. Three basic Web design principles, i.e., structural firmness, functional convenience, and presentational delight, were identified to help improve users’ online experience. Structural firmness relates primarily to the characteristics that influence the website security and performance. Functional convenience refers to the availability of convenient characteristics, such as a site’s ease of use and ease of navigation, that help users’ interaction with the interface. Presentational delight refers to the website characteristics that stimulate users’ senses. Usability engineering provides methods for measuring usability and for addressing usability issues. Heuristic evaluation by experts and user-centered testing are typically used to identify usability issues and to ensure satisfactory usability.

However, significant challenges exist, including 1) accuracy of problem identification due to false alarms common in expert evaluation 2) unrealistic evaluation of usability due to differences between the testing environment and the actual usage environment, and 3) increased cost due to the prolonged evolution and maintenance cycles typical for many Web applications. On the other hand, log data routinely kept at Web servers represent actual usage. Such data have been used for usage-based testing and quality assurance and also for understanding user behavior and guiding user interface design.

Server-side logs can be automatically generated by Web servers, with each entry corresponding to a user request. By analyzing these logs, Web workload was characterized and used to suggest performance enhancements for Internet Web servers. Because of the vastly uneven Web traffic, massive user population, and diverse usage environment, coverage-based testing is insufficient to ensure the quality of Web applications. Therefore, server-side logs have been used to construct Web usage models for usage-based Web testing or to automatically generate test cases accordingly to improve test efficiency.

LITRATURE SURVEY

WEB USABILITY PROBE: A TOOL FOR SUPPORTING REMOTE USABILITY EVALUATION OF WEB SITES

PUBLICATION: Human-Computer Interaction—INTERACT 2011. New York, NY, USA: Springer, 2011,pp. 349–357.

AUTHORS: T. Carta, F. Patern`o, and V. F. D. Santana

EXPLANATION:

Usability evaluation of Web sites is still a difficult and time-consuming task, often performed manually. This paper presents a tool that supports remote usability evaluation of Web sites. The tool considers client-side data on user interactions and JavaScript events. In addition, it allows the definition of custom events, giving evaluators the flexibility to add specific events to be detected and considered in the evaluation. The tool supports evaluation of any Web site by exploiting a proxy-based architecture and enables the evaluator to perform a comparison between actual user behavior and an optimal sequence of actions.

SUPPORTING ACTIVITY MODELLING FROM ACTIVITY TRACES

PUBLICATION: Expert Syst., vol. 29, no. 3, pp. 261–275, 2012.

AUTHORS: O. L. Georgeon, A. Mille, T. Bellet, B. Mathern, and F. E. Ritter,

EXPLANATION:

We present a new method and tool for activity modelling through qualitative sequential data analysis. In particular, we address the question of constructing a symbolic abstract representation of an activity from an activity trace. We use knowledge engineering techniques to help the analyst build ontology of the activity, that is, a set of symbols and hierarchical semantics that supports the construction of activity models. The ontology construction is pragmatic, evolutionist and driven by the analyst in accordance with their modelling goals and their research questions. Our tool helps the analyst define transformation rules to process the raw trace into abstract traces based on the ontology. The analyst visualizes the abstract traces and iteratively tests the ontology, the transformation rules and the visualization format to confirm the models of activity. With this tool and this method, we found innovative ways to represent a car-driving activity at different levels of abstraction from activity traces collected from an instrumented vehicle. As examples, we report two new strategies of lane changing on motorways that we have found and modelled with this approach.

TOOLS FOR REMOTE USABILITY EVALUATION OF WEB APPLICATIONS THROUGH BROWSER LOGS AND TASK MODELS

PUBLICATION: Behavior Res.Methods, Instrum., Comput., vol. 35, no. 3, pp. 369–378, 2003

AUTHORS: L. Paganelli and F. Patern`o,

EXPLANATION:

The dissemination of Web applications is extensive and still growing. The great penetration of Web sites raises a number of challenges for usability evaluators. Video-based analysis can be rather expensive and may provide limited results. In this article, we discuss what information can be provided by automatic tools able to process the information contained in browser logs and task models. To this end, we present a tool that can be used to compare log files of user behavior with the task model representing the actual Web site design, in order to identify where users’ interactions deviate from those envisioned by the system design.

SYSTEM ANALYSIS

EXISTING SYSTEM:

Previous studies usability has long been addressed and discussed, when people navigate the Web they often encounter a number of usability issues. This is also due to the fact that Web surfers often decide on the spur of the moment what to do and whether to continue to navigate in a Web site. Usability evaluation is thus an important phase in the deployment of Web applications. For this purpose automatic tools are very useful to gather larger amount of usability data and support their analysis.

Remote evaluation implies that users and evaluators are separated in time and/or space. This is important in order to analyse users in their daily environments and decreases the costs of the evaluation without requiring the use of specific laboratories and asking the users to move. In addition, tools for remote Web usability evaluation should be sufficiently general so that they can be used to analyse user behaviour even when using various browsers or applications developed using different toolkits. We prefer logging on the client-side in order to be able to capture any user-generated events, which can provide useful hints regarding possible usability problems.

Existing approaches have been used to support usability evaluation. An example was WebRemUsine, which was a tool for remote usability evaluation of Web applications through browser logs and task models. Propp and Frorbrig have used task models for supporting usability evaluation of a different type of application: cooperative behaviour of people interacting in smart environments. A different use of models is in the authors discuss how task models can enhance visualization of the usability test log. In our case we do not require the effort of developing models to apply our tool. We only require that the designer provides an example of optimal use associated with each of the relevant tasks. The tool will then compare the logs with the actual use with the optimal log in order to identify deviations, which may indicate potential usability problems.

DISADVANTAGES:

Web navigate used a logger to collect data from a user session test on a Web interface prototype running on a PDA simulator in order to evaluate different types of Web navigation tools and identify the best one for small display devices.

Users were asked to find the answer to specific questions using different types of navigation tools to move from one page to another. A database was used to store users’ actions, but they logged only the answer given by the user to each specific question. Moreover they stored separately every term searched by the user by means of the internal search tool.

Client-side data encounters different challenges regarding the identification of the elements that users are interacting with, how to manage element identification when the page is changed dynamically, how to manage data logging when users are going from one page to another, amongst others. The following are some of the solutions we adopted in order to deal with these issues.

PROPOSED SYSTEM:

We propose a new method to identify navigation related usability problems by comparing Web usage patterns extracted from server logs against anticipated usage represented in some cognitive user models (RQ2). Fig. 1 shows the architecture of our method. It includes three major modules: Usage Pattern Extraction, IUIP Modeling, and Usability Problem Identification. First, we extract actual navigation paths from server logs and discover patterns for some typical events. In parallel, we construct IUIP models for the same events. IUIP models are based on the cognition of user behavior and can represent anticipated paths for specific user-oriented tasks.

Our IUIP models are based on the cognitive models surveyed in Section II, particularly the ACT-R model. Due to the complexity of ACT-R model development and the low-level rule based programming language it relies on we constructed our own cognitive architecture and supporting tool based on the ideas from ACT-R. In general, the user behavior patterns can be traced with a sequence of states and transitions. Our IUIP consists of a number of states and transitions. For a particular goal, a sequence of related operation rules can be specified for a series of transitions. Our IUIP model specifies both the path and the benchmark interactive time (no more than a maximum time) for some specific states (pages). The benchmark time can first be specified based on general rules for common types of Web pages. Humans usually try to complete their tasks in the most efficient manner by attempting to maximize their returns while minimizing the cost.

Typically, experts and novices will have different task performance. Novices need to learn task specific knowledge while performing the task, but experts can complete the task in the most efficient manner. Based on this cognitive mechanism, IUIP models our method is cost-effective. It would be particularly valuable in the two common situations, where an adequate number of actual users cannot be involved in testing and cognitive experts are in short supply. Server logs in our method represent real users’ operations in natural working conditions, and our IUIP models injected with human behavior cognition represent part of cognitive experts’ work. We are currently integrating these modeling and analysis tools into a tool suite that supports measurement, analysis, and overall quality improvement for Web applications.

ADVANTAGES:

1) Logical deviation calculation:

  1. a) When the path choice anticipated by the IUIP model is available but not selected, a single deviation is counted.
  2. b) Sum up all the above deviations over all the selected user transactions for each page.

2) Temporal deviation calculation:

  1. a) When a user spends more time at a specific page than the benchmark specified for the corresponding state in the IUIP model, a single deviation is counted.
  2. b) Sum up all the above deviations over all the selected user transactions for each page.

The successive pages related to furniture categories are grouped into a dashed box. The pages with deviations and the unanticipated follow up pages below them are marked with solid rectangular boxes. Those unanticipated follow up pages will not be used themselves for deviation calculations to avoid double counting.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive                       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor      –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007

Improving Physical-Layer Security in Wireless Communications Using Diversity Techniques

Due to the broadcast nature of radio propagation, wireless transmission can be readily overheard by unauthorized users for interception purposes and is thus highly vulnerable to eavesdropping attacks. To this end, physical-layer security is emerging as a promising paradigm to protect the wireless communications against eavesdropping attacks by exploiting the physical characteristics of wireless channels. This article is focused on the investigation of diversity techniques to improve physical-layer security differently from the conventional artificial noise generation and beamforming techniques, which typically consume additional power for generating artificial noise and exhibit high implementation complexity for beamformer design. We present several diversity approaches to improve wireless physical-layer security, including multiple-input multiple-output (MIMO), multiuser diversity, and cooperative diversity. To illustrate the security improvement through diversity, we propose a case study of exploiting cooperative relays to assist the signal transmission from source to destination while defending against eavesdropping attacks.
We evaluate the security performance of cooperative relay transmission in Rayleigh fading environments in terms of secrecy capacity and intercept probability. It is shown that as the number of relays increases, both the secrecy capacity and intercept probability of cooperative relay transmission improve  significantly, implying there is an advantage in exploiting cooperative diversity to improve physical-layer security against eavesdropping attacks.

In wireless networks, transmission between legitimate users can easily be overheard by an eavesdropper for interception due to the broadcast nature of the wireless medium, making wireless transmission highly vulnerable to eavesdropping attacks. In order to achieve confidential transmission, existing communications systems typically adopt the cryptographic techniques to prevent an eavesdropper from tapping data transmission between legitimate users [1, 2]. By considering symmetric key encryption as an example, the original data (called plaintext) is first encrypted at the source node by using an  encryption algorithm along with a secret key that is shared only with the destination node. Then the encrypted plaintext (also known as ciphertext) is transmitted to the destination, which will decrypt its received ciphertext with the preshared secret key. In this way, even if an eavesdropper overhears the ciphertext transmission, it is still difficult for the eavesdropper to interpret the plaintext from its intercepted ciphertext without the secret key. It is pointed out that ciphertext transmission is not perfectly secure, since the ciphertext can still be decrypted by an eavesdropper through an exhaustive key search, which is also known as a brute-force attack. To this end, physical-layer security is emerging as an alternative paradigm to protect wireless communications against eavesdropping attacks, including brute-force attacks.
Physical-layer security work was pioneered by Wyner in [3], where a discrete memoryless wiretap channel was examined for secure communications in the presence of an eavesdropper. It was proved in [3] that perfectly secure data transmission can be achieved if the channel capacity of the main link (from source to destination) is higher than that of the wiretap link (from source to eavesdropper). Later on, in [4], Wyner’s results were extended from the discrete memoryless wiretap channel to the Gaussian wiretap channel, where a so-called secrecy capacity was developed, and shown as the difference between the channel capacity of the main link and that of the wiretap link. If the secrecy capacity falls below zero, the transmission from source to destination becomes insecure, and the eavesdropper can succeed in intercepting the source transmission (i.e., an intercept event occurs). In order to improve transmission security against eavesdropping attacks, it is of importance to reduce the probability of occurrence of an intercept event (called intercept probability) through enlarging secrecy capacity. However, in wireless communications, secrecy capacity is severely degraded due to the fading effect.

As a consequence, there are extensive works aimed at  increasing the secrecy capacity of wireless communications by exploiting multiple antennas [5] and cooperative relays [6].
Specifically, the multiple-input multiple-output (MIMO) wiretap channel was studied in [7] to enhance the wireless secrecy capacity in fading environments. In [8], cooperative relays were examined for improving the physical-layer security in terms of the secrecy rate performance. A hybrid cooperative beamforming and jamming approach was investigated in [9] to enhance the wireless secrecy capacity, where partial relay nodes are allowed to assist the source transmission to the legitimate destination with the aid of distributed beamforming, while the remaining relay nodes are used to transmit artificial noise to confuse the eavesdropper. More recently, a joint physical-application layer security framework was proposed in [10] for improving the security of wireless multimedia delivery by simultaneously exploiting physical-layer signal processing techniques as well as upper-layer authentication and watermarking methods. In [11], error control coding for secrecy was discussed for achieving the physical-layer security.
Additionally, in [12, 13], physical-layer security was further investigated in emerging cognitive radio networks. At the time of writing, most research efforts are devoted to examining the artificial noise and beamforming techniques to combat eavesdropping attacks, but they consume additional power resources to generating artificial noise and increase the computational complexity in performing beamformer design.
Therefore, this article is motivated to enhance the physicallayer security through diversity techniques without additional power costs, including MIMO, multiuser diversity, and cooperative diversity, aimed at increasing the capacity of the main channel while degrading the wiretap channel. For illustration purposes, we present a case study of exploiting cooperative relays to improve the physical-layer security against eavesdropping attacks, where the best relay is selected and used to participate in forwarding the signal transmission from source to destination. We evaluate the secrecy capacity and intercept probability of the proposed cooperative relay transmission in Rayleigh fading environments. It is shown that with an increasing number of relays, the security performance of cooperative relay transmission significantly improves in terms of secrecy capacity and intercept probability. This confirms the advantage of using cooperative relays to protect wireless communications against eavesdropping attacks.
The remainder of this article is organized as follows. The next section presents the system model of physical-layer security in wireless communications. After that, we focus on the physical-layer security enhancement through diversity techniques, including MIMO, multiuser diversity, and cooperative diversity. For the purpose of illustrating the security improvement through diversity, we present a case study of exploiting cooperative relays to assist signal transmission from source to destination against eavesdropping attacks. Finally, we provide some concluding remarks.

IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING

ABSTRACT:

Identity-Based Encryption (IBE) which simplifies the public key and certificate management at Public Key Infrastructure (PKI) is an important alternative to public key encryption. However, one of the main efficiency drawbacks of IBE is the overhead computation at Private Key Generator (PKG) during user revocation. Efficient revocation has been well studied in traditional PKI setting, but the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. In this paper, aiming at tackling the critical issue of identity revocation, we introduce outsourcing computation into IBE for the first time and propose a revocable IBE scheme in the server-aided setting.

Our scheme offloads most of the key generation related operations during key-issuing and key-update processes to a Key Update Cloud Service Provider, leaving only a constant number of simple operations for PKG and users to perform locally. This goal is achieved by utilizing a novel collusion-resistant technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound the identity component and the time component. Furthermore, we propose another construction which is provable secure under the recently formulized Refereed Delegation of Computation model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

INTRODUCTION:

Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys. Therefore, sender using IBE does not need to look up public key and certificate, but directly encrypts message with receiver’s identity.

Accordingly, receiver obtaining the private key associated with the corresponding identity from Private Key Generator (PKG) is able to decrypt such ciphertext. Though IBE allows an arbitrary string as the public key which is considered as appealing advantages over PKI, it demands an efficient revocation mechanism. Specifically, if the private keys of some users get compromised, we must provide a mean to revoke such users from system. In PKI setting, revocation mechanism is realized by appending validity periods to certificates or using involved combinations of techniques.

Nevertheless, the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. As far as we know, though revocation has been thoroughly studied in PKI, few revocation mechanisms are known in IBE setting. In Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period. But this mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

In presented a revocable IBE scheme. Their scheme is built on the idea of fuzzy IBE primitive but utilizing a binary tree data structure to record users’ identities at leaf nodes. Therefore, key-update efficiency at PKG is able to be significantly reduced from linear to the height of such binary tree (i.e. logarithmic in the number ofusers). Nevertheless, we point out that though the binary tree introduction is able to achieve a relative high performance, it will result in other problems:

1) PKG has to generate a key pair for all the nodes on the path from the identity leaf node to the root node, which results in complexity logarithmic in the number of users in system for issuing a single private key.

2) The size of private key grows in logarithmic in the number of users in system, which makes it difficult in private key storage for users.

3) As the number of users in system grows, PKG has to maintain a binary tree with a large amount of nodes, which introduces another bottleneck for the global system. In tandem with the development of cloud computing, there has emerged the ability for users to buy on-demand computing from cloud-based services such as Amazon’s EC2 and Microsoft’s Windows Azure. Thus it desires a new working paradigm for introducing such cloud services into IBE revocation to fix the issue of efficiency and storage overhead described above. A naive approach would be to simply hand over the PKG’s master key to the Cloud Service Providers (CSPs).

The CSPs could then simply update all the private keys by using the traditional key update technique [4] and transmit the private keys back to unrevoked users. However, the naive approach is based on an unrealistic assumption that the CSPs are fully trusted and is allowed to access the master key for IBE system. On the contrary, in practice the public clouds are likely outside of the same trusted domain of users and are curious for users’ individual privacy. For this reason, a challenge on how to design a secure revocable IBE scheme to reduce the overhead computation at PKG with an untrusted CSP is raised.

In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally. In our scheme, as with the suggestion in realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users.

We propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component. At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).

Our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP. We also specify that 1) with the aid of KU-CSP, user needs not to contact with PKG in key-update, and in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP. 2) No secure channel or user authentication is required during key-update between user and KU-CSP. Furthermore, we consider realizing revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction

EXISTING SYSTEM:

  • Identity-Based Encryption (IBE) is an interesting alternative to public key encryption, which is proposed to simplify key management in a certificate-based Public Key Infrastructure (PKI) by using human-intelligible identities (e.g., unique name, email address, IP address, etc) as public keys.
  • Boneh and Franklin suggested that users renew their private keys periodically and senders use the receivers’ identities concatenated with current time period.
  • Hanaoka et al. proposed a way for users to periodically renew their private keys without interacting with PKG.
  • Lin et al. proposed a space efficient revocable IBE mechanism from non-monotonic Attribute-Based Encryption (ABE), but their construction requires times bilinear pairing operations for a single decryption where the number of revoked users is.

DISADVANTAGES:

Boneh and Franklin mechanism would result in an overhead load at PKG. In another word, all the users regardless of whether their keys have been revoked or not, have to contact with PKG periodically to prove their identities and update new private keys. It requires that PKG is online and the secure channel must be maintained for all transactions, which will become a bottleneck for IBE system as the number of users grows.

  • Boneh and Franklin’s suggestion is more a viable solution but impractical.
  • In Hanaoka et al system, however, the assumption required in their work is that each user needs to possess a tamper-resistant hardware device.
  • If an identity is revoked then the mediator is instructed to stop helping the user. Obviously, it is impractical since all users are unable to decrypt on their own and they need to communicate with mediator for each decryption.

PROPOSED SYSTEM:

  • In this paper, we introduce outsourcing computation into IBE revocation, and formalize the security definition of outsourced revocable IBE for the first time to the best of our knowledge. We propose a scheme to offload all the key generation related operations during key-issuing and keyupdate, leaving only a constant number of simple operations for PKG and eligible users to perform locally.
  • In our scheme, as with the suggestion, we realize revocation through updating the private keys of the unrevoked users. But unlike that work which trivially concatenates time period with identity for key generation/update and requires to re-issue the whole private key for unrevoked users, we propose a novel collusion-resistant key issuing technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound two sub-components, namely the identity component and the time component.
  • At first, user is able to obtain the identity component and a default time component (i.e., for current time period) from PKG as his/her private key in key-issuing. Afterwards, in order to maintain decryptability, unrevoked users needs to periodically request on keyupdate for time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP).

ADVANTAGES:

  • Compared with the previous work, our scheme does not have to re-issue the whole private keys, but just need to update a lightweight component of it at a specialized entity KU-CSP.
  • We also specify in the aid of KU-CSP, user needs not to contact with PKG in key-update, in other words, PKG is allowed to be offline after sending the revocation list to KU-CSP.
  • No secure channel or user authentication is required during key-update between user and KU-CSP.
  • Furthermore, we consider to realize revocable IBE with a semi-honest KU-CSP. To achieve this goal, we present a security enhanced construction under the recently formalized Refereed Delegation of Computation (RDoC) model.
  • Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.

HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed       –    1 GHz
  • RAM       –    256 MB (min)
  • Hard Disk      –   20 GB
  • Floppy Drive       –    44 MB
  • Key Board      –    Standard Windows Keyboard
  • Mouse       –    Two or Three Button Mouse
  • Monitor              –    SVGA

SOFTWARE REQUIREMENTS:

  • Operating System        :           Windows XP or Win7
  • Front End       :           JAVA JDK 1.7
  • Back End :           MYSQL Server
  • Server :           Apache Tomact Server
  • Script :           JSP Script
  • Document :           MS-Office 2007