Statistical Dissemination Control in Large Machine-to-Machine Communication Networks

05/08/201902/07/2019 by admin

Cloud based machine-to-machine (M2M) communications have emerged to achieve ubiquitous and autonomous data transportation for future daily life in the cyber-physical world. In light of the need of network characterizations, we analyze the connected M2M network in the machine swarm of geometric random graph topology, including degree distribution, network diameter, and average distance (i.e., hops). Without the need of end-to-end information to escape catastrophic complexity, information dissemination appears an effective way in machine swarm. To fully understand practical data transportation, G/G/1 queuing network model is exploited to obtain average end-to-end delay and maximum achievable system throughput.

Furthermore, as real applications may require dependable networking performance across the swarm, quality of service (QoS) along with large network diameter creates a new intellectual challenge. We extend the concept of small-world network to form shortcuts among data aggregators as infrastructure-swarm two-tier heterogeneous network architecture, and then leverage the statistical concept of network control instead of precise network optimization, to innovatively achieve QoS guarantees. Simulation results further confirm the proposed heterogeneous network architecture to effectively control delay guarantees in a statistical way and to facilitate a new design paradigm in reliable M2M communications.

1.2 INTRODUCTION:

Cloud based machine-to-machine (M2M) communications have emerged to enable services through interaction between cyber and physical worlds, achieving ubiquitous and autonomous data transportation among objects and the surrounding environment in our daily lives. The wireless network involving tremendous machines that the availability of end-to-end information at each machine is not possible is referred to the large M2M network, which is getting importance into next-generation wireless systems. While these tremendous machines have short-range communication capabilities, multi-hop networking is a must for information dissemination over machine swarm. The connectivity and low delivery latency in the machine swarm are consequently crucial to achieve reliable communications.

However, lacking complete understanding of large network characteristics, effective traffic control for message delivery remains open a proper control scheme of routing with quality-of-service (QoS) guarantee regarding end-to-end delay becomes an urgent need to practically facilitate M2M communications. This is even more challenging due to the scalability of multi-hop ad hoc networks and energy-efficient and spectral efficient operation for each machine. To investigate the routing mechanism for large-scale networks, network topology analysis can be scientifically exploited by random network analysis provides a comprehensive study in network structure and functions from complex networks perspective. Aiming at social communities mediated by network technologies, reviews the historical research for community analysis and community discovery methods in social media.

We develop an unbiased sampling for users in an online social network by crawling the social graph, they further examine multiple underlying relations for such network in to introduce a random walk sampling. For social networks related research, proposes the information-centric networking as it brings the advantages to the network operator and the end users. Exploring various research challenges in context management, presents a context management architecture that is suitable for social networking systems enhanced with pervasive features. Through a survey of current routing solutions, discuss the trend toward social based routing protocols, which are classified by employed network graph.

In addition, to employ social network analysis in message delivery remarkably pioneers the methodology to exercise the small-world phenomenon of social networks in navigation, successfully creating transmissions with less delay. Small-world phenomenon plays a crucial role in social networks, which states that each individual in such network links to others by a short chain of acquaintances and has great potential for improving spectral and energy efficiency for shorting the end-to-end delay. Reference also presents a thorough examination of average message delivery time for small-world networks in the continuum limit. Via random network analysis, studies the properties of giant component in wireless multi-hop networks, while provides a heterogeneous structure for such networks and conducts the throughput and delay analysis. Furthermore, the concepts of rumor and gossip routing algorithms are also widely employed in sensor networks for disconnected delay-tolerant MANETs and generalized complex networks, and respectively provide the social network analysis for information flow and epidemic information dissemination.

In this paper, inspired by small-world phenomenon, we connect data aggregators (DAs) to machine swarm and propose a promising two-tier heterogeneous architecture with DA’s smallworld network for statistical traffic control in large M2M communication networks. To address efficient dissemination control for routing and QoS such as surveillance applications, we first analytically supply the condition to establish connected M2M networks and explore some essential geometric properties (i.e., degree distribution, network diameter, and average distance) for the networks. Analytic bounds of average distance characterize the average number of hops that machines’ packets need to traverse over the swarm, thus dominating the QoS guarantee capability for reliable communications. Furthermore, through G/G/1 (i.e., both inter-arrival time and service time distributions of a traffic queue are arbitrary distributions) queuing network model for traffic modeling, the practical data transportation takes place in connected M2M networks. Both the average end-to-end delay and maximum achievable throughput per machine from information dissemination in machine swarm multi-hop networking are examined.

1.3 LITRATURE SURVEY

TOWARD UBIQUOTOUS MASSIVE ACCESS IN 3GPP MACHINE-TO-MACHINE COMMUNICATIONS IN 3GPP

AUTHOR: S. Lien, K. C. Chen, and Y. Lin,

PUBLISH: IEEE Commun. Mag., vol. 49, no. 4, pp. 66–74, Apr. 2011.

EXPLANATION:

To enable full mechanical automation where each smart device can play multiple roles among sensor, decision maker, and action executor, it is essential to construct scrupulous connections among all devices. Machine-to-machine communications thus emerge to achieve ubiquitous communications among all devices. With the merit of providing higher-layer connections, scenarios of 3GPP have been regarded as the promising solution facilitating M2M communications, which is being standardized as an emphatic application to be supported by LTE-Advanced. However, distinct features in M2M communications create diverse challenges from those in human-to-human communications. To deeply understand M2M communications in 3GPP, in this article, we provide an overview of the network architecture and features of M2M communications in 3GPP, and identify potential issues on the air interface, including physical layer transmissions, the random access procedure, and radio resources allocation supporting the most critical QoS provisioning. An effective solution is further proposed to provide QoS guarantees to facilitate M2M applications with inviolable hard timing constraints.

SMALL-WORLD NETWORKS EMPOWERED LARGE MACHINE-TO-MACHINE COMMUNICATIONS

AUTHOR: L. Gu, S. C. Lin, and K. C. Chen

PUBLISH: IEEE WCNC, 2013, pp. 1–6.

EXPLANATION:

Cloud-based machine-to-machine communications emerge to facilitate services through linkage between cyber and physical worlds. In addition to great challenges in a large network of machine/sensor swarm, effective network architecture involving interconnection of wireless infrastructure and multi-hop ad hoc networking in the machine swarm remains open. Inspired by the small-world phenomenon in social networks, we may establish a short-cut path under heterogeneous network architecture through wireless infrastructure and cloud, by connecting to data aggregators or access points in the machine swarm, such that end-to-end delay can be significantly reduced. Our mathematical analysis on network diameter and average delay, along with verifications by simulations, demonstrate spectral and energy efficiency of our proposed heterogeneous network architecture in large machine-to-machine communication networks.

COGNITIVE MACHINE-TO-MACHINE COMMUNICATIONS: VISIONS AND POTENTIALS FOR THE SMART GRID

AUTHOR: Y. Zhang et al.,

PUBLISH: IEEE Netw., vol. 26, no. 3, pp. 6–13, May/Jun. 2012.

EXPLANATION:

Visual capability introduced to Wireless Sensor Networks (WSNs) render many novel applications that would otherwise be infeasible. However, unlike legacy WSNs which are commercially deployed in applications, visual sensor networks create additional research problems that delay the real-world implementations. Conveying real-time video streams over resource constrained sensor hardware remains to be a challenging task. As a remedy, we propose a fairness-based approach to enhance the event reporting and detection performance of the Video Surveillance Sensor Networks. Instead of achieving fairness only for flows or for nodes as investigated in the literature, we concentrate on the whole application requirement. Accordingly, our Event-Based Fairness (EBF) scheme aims at fair resource allocation for the application level messaging units called events. We identify the crucial network-wide resources as the in-queue processing turn of the frames and the channel access opportunities of the nodes. We show that fair treatment of events, as opposed to regular flow of frames, results in enhanced performance in terms of the number of frames reported per event and the reporting latency. EBF is a robust mechanism that can be used as a stand-alone or as a complementary method to other possible performance enhancement methods for video sensor networks implemented at other communication layers.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods for nodes as investigated in the literature; machine-to-machine communications emerge to facilitate services through linkage between cyber and physical worlds. In addition to great challenges in a large network of machine/sensor swarm, effective network architecture involving interconnection of wireless infrastructure and multi-hop ad hoc networking in the machine swarm remains open. Inspired by the small-world phenomenon in social networks, we may establish a short-cut path under heterogeneous network.

Previous discussion of existing tradeoff, but heterogeneous schemes are able to provide promising guaranteed throughput even under strict QoS demand for tight τ.Moreover, Fig. 8 further provides the exhaustive throughput comparison among different scenarios to complete our evaluation. While QoS guaranteed throughput is upper bounded by maximum achievable throughput, the great throughput improvement is provided by heterogeneous architecture as compared with plain machine swarm.

QoS fair resource allocation for the application level messaging units called events. We identify the crucial network-wide resources as the in-queue processing turn of the frames and the channel access opportunities of the nodes that fair treatment of events, as opposed to regular flow of frames, results in enhanced performance in terms of the number of frames reported per event and the reporting latency can be used as a stand-alone or as a complementary method to other possible performance enhancement methods for video sensor networks implemented at other communication layers.

2.1.1 DISADVANTAGES:

Single source-destination pair, there exist a source machine, a destination machine, and several relay machines that forward traffic from the source to the destination.

Data loss of generality, it is assumed that sequences of packets follow the general arrival process and the general service time, and each transmission link is modeled.

Such a queue represents a queuing system with a single server, infinite buffer size, and the scheduling discipline of interarrival times have a general (meaning arbitrary) distribution and service times have a (different) general distribution.

2.2 PROPOSED SYSTEM:

Machine-to-machine (M2M) communications emerge to autonomously operate to link interactions between Internet cyber world and physical systems. We present the technological scenario of M2M communications consisting of wireless infrastructure to cloud, and machine swarm of tremendous devices. Related technologies toward practical realization are explored to complete fundamental understanding and engineering knowledge of this new communication and networking technology front. We connect data aggregators (DAs) to machine swarm and propose a promising two-tier heterogeneous architecture with DA’s smallworld network for statistical traffic control in large M2M communication networks address efficient dissemination control for routing and QoS such as surveillance applications.

We first analytically supply the condition to establish connected M2M networks and explore some essential geometric properties (i.e., degree distribution, network diameter, and average distance) for the networks. Analytic bounds of average distance characterize the average number of hops that machines’ packets need to traverse over the swarm, thus dominating the QoS guarantee capability for reliable communications. Furthermore, through G/G/1 (i.e., both inter-arrival time and service time distributions of a traffic queue are arbitrary distributions) queuing network model for traffic modeling, the practical data transportation takes place in connected M2M networks.

Aiming at statistical performance in large M2M networks, we propose a statistical control mechanism for the networks by establishing the heterogeneous network architecture and exploiting statistical QoS guarantee for end-toend transmissions without the need of feedback control at each link. By forming DA’s network with small-world property and linking machines to DAs, this novel heterogeneous architecture significantly improves the performance of end-to-end traffic for tolerable delay and makes dependable communications possible from guaranteing traffic QoS, with extremely simple network operation for each machine.

2.2.1 ADVANTAGES:

To understand geometric properties of large M2M networks and thus benchmark performance, we first analytically examine network connectivity, degree, distribution, network diameter, and average distance under Poisson Point Process (PPP) machine distribution.

Introducing queuing network theory into such network analysis for practical data transportation, the average delay and achievable throughput for message delivery in connected M2M networks are analytically obtained as well as the QoS guaranteed throughput in real applications.

Standing on hereby established analysis, statistical dissemination control is proposed that incorporates DA’s network with machine swarm (or sensor swarm) for favorable heterogeneous network architecture.

Due to infeasible end-to-end information exchange and subsequent precise control, we exploit statistical QoS guarantees over two-tier heterogeneous network architecture to exhibit remarkable enhancement of system performance, and to facilitate the merits of small-world phenomenon into engineering reality.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tools : Netbeans 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

GEOMETRIC RANDOM GRAPH (GRG) :

M2M communication network consists of tremendous self organized machines/sensors and enables autonomous connections among different applications for ubiquitous communications upon such large swarm system. To facilitate this scenario into practice, providing the connectivity accompanied with reliable transportation is a must for such large network. In the following, we highlight the relevant research and introduce the M2M network model using geometric random graph (GRG) as its topology and local clustering property are suitable for benchmarking large wireless ad hoc sensor networks.

Without the need of end-to-end information to escape catastrophic complexity, information dissemination becomes the only way in machine swarm. We exploit an open G/G/1 queuing network model for delay and throughput analysis of M2M networks. Furthermore, the diffusion approximation is used to analyze the queuing network. Our analytical methodology to deal with wireless networks have general inter-arrival and service time distributions by providing closed form expressions of end-to-end delay and maximum achievable throughput per node. In the following, to fully understand practical data transportation, we present the traffic model and an equivalent queuing network model in connected M2M networks.

4.1 ALGORITHM

M2M ROUTING ALGORITHM:

M2M routing algorithm, this paper studies the asymptotic performance of several statistical QoS requirements, such as end-to-end delay and maximum throughput as well as the throughput under guaranteed delay, for a general forwarding scheme inM2M network. What is more important, our previous work focuses on obtaining the traffic performance under a specific scenario setting, which can simplify the analysis, while failing to maintain the same level of transmission qualities when the scenario changes, e.g., the network topology or traffic pattern becomes different.

Proposed algorithms solve this challenge through statistical dissemination control by leveraging the heterogeneous network architecture. In particular, the upper layer of DAs’ network enables shortcut transmissions to reduce the excess end-to-end delay from the long route transmissions in the lower layer of machine swarm. A comprehensive performance analysis upon such a heterogeneous architecture is also included in this paper. With these accomplishments, we provide an original and significant paradigm to facilitate M2M communications, practically realizing information dissemination control to meet the need of time sensitive applications in next-generation wireless standards.

4.2 MODULES:

NETWORK TOPOLOGY DESIGN:

SERVER CLIENT MODULE:

STATISTICAL QOS GUARANTEE:

M2M COMMUNICATION CONTROL:

END-TO-END DELAY ANALYSIS:

4.3 MODULE DESCRIPTION:

NETWORK TOPOLOGY DESIGN:

This module is developed to wireless mesh based Topology design all node place particular distance. Without using any cables then fully wireless equipment based transmission and received packet data. Node and wireless sensor between calculate distance and transmission range then physically all nodes interconnected. The sink is at the center of the circular sensing area.

This module is developed to node creation and more than 20 nodes placed particular distance. Wireless sensor placed intermediate area. Each node knows its location relative to the sink. Each node is programmed with the total number of nodes in the network.

SERVER CLIENT MODULE:

Client-server computing or networking is a distributed application architecture that partitions tasks or workloads between service providers (servers) and service requesters, called clients. Often clients and servers operate over a computer network on separate hardware. A server machine is a high-performance host that is running one or more server programs which share its resources with clients. A client also shares any of its resources; Clients therefore initiate communication sessions with servers which await (listen to) incoming requests.

STATISTICAL QOS GUARANTEE:

M2M COMMUNICATION CONTROL:

M2M communication with low data rate and energy cost, the machine-to-DA communication with medium data rate, and the DA-to-DA communication with high data rate. We adopt the related values from as shown in Table II and set up the experiment as follows. The 1 Mb data is sent from the source machine to the destination machine in both plain machine swarm and heterogeneous architecture separately. Moreover, DAs’ communication capabilities are characterized as the number of machines z that can be served simultaneously by each single DA.

DAs for heterogeneous architecture with respect to the number of machines in the DA’s capability linearly increases, the required number of DAs drops exponentially. It suggests that few powerful DAs are preferable than bunch of DAs with limited capability. Furthermore, Fig. 10 shows the average end-to-end delay with respect to different area sizes of Metropolis. As the area size increases (so does the number of machines in each block), the heterogeneous architecture supports much less traffic delay than the plain machine swarm.

For example, with the area size 60 km2 and 108 machines, the delay from heterogeneous architecture is 115 s as compared to 2,500 s from the swarm. Moreover, the linear curves in the log scale of Fig. 10(b) confirms our asymptotic results, and suggest that the heterogeneous architecture outperforms the plain machine swarm with about 95% delay reduction for 10 billion machines. To conclude, by efficiently connecting few DAs to construct small world shortcuts, proposed statistical control accompanied with heterogeneous architecture resolves the undependable end-to end transmissions.

END-TO-END DELAY ANALYSIS:

We compare the performance of the proposed heterogeneous network architecture with plain machine swarm. Simulation results confirm that heterogeneous architecture achieves remarkable delay reduction as well as high throughput gain with only few DAs installed, favored by practical implementation in large M2M networks. All simulation parameters and value settings are listed in Table I. In particular, to ensure every packet could be sent to its corresponding destination from the source, a connected M2M network is first established via the proposed analysis (i.e., selecting the appropriate machine communication range r with respect to the total machine number n). When a source machine generates a packet, it routes the packet to a specific destination, uniformly selected among other machines.

Moreover, for plain machine swarm, source simply hops forward based on the sensing and relaying; for heterogeneous architecture, it employs dissemination without selecting a particular DA. In the following, we first evaluate average distance to DAs and end-to-end distance for plain machine swarm and heterogeneous architecture. Next, end-toend packet delay, maximum system throughput, and throughput under guaranteed delay are thoroughly examined for such different architecture and compared with simulation validation in the Metropolis is established to facilitate our design into an even more practical stage.

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

In this paper, we resolve the most critical challenge on providing statistical control for reliable information dissemination over large M2M communication networks. Examining network topology of M2M networks, the geometric properties of such large networks are well studied to analytically characterize message delivery over connected M2M networks.

Moreover, by leveraging queuing network model, the practical data transportation is employed and both the average end-to end delay and maximum achievable throughput for these connected networks are accessible. Based on above explorations, the promising statistical control with sophisticated small-world network of data aggregators and thus the heterogeneous architecture are proposed to establish shortcut paths among machine communications.

Performance evaluation verifies that instead of exploiting long concatenation of multi-hop transmissions in the machine swarm, our heterogeneous network architecture enables machines to communicate through overlaid ultra-fast “highway”, like shortcut in small-world networks, with desired throughput. It is particularly crucial for next-generation networks of tremendous amounts of machines. Therefore, we successfully achieve reliable communications via our proposed methodology and facilitate novel traffic control in M2M communication networks.

Shared Authority Based Privacy-Preserving Authentication Protocol in Cloud Computing

05/08/201902/07/2019 by admin

Shared Authority Based Privacy-PreservingAuthentication Protocol in Cloud ComputingHong Liu, Student Member, IEEE, Huansheng Ning, Senior Member, IEEE,Qingxu Xiong, Member, IEEE, and Laurence T. Yang, Member, IEEEAbstract—Cloud computing is an emerging data interactive paradigm to realize users’ data remotely stored in an online cloudserver. Cloud services provide great conveniences for the users to enjoy the on-demand cloud applications without considering thelocal infrastructure limitations. During the data accessing, different users may be in a collaborative relationship, and thus datasharing becomes significant to achieve productive benefits. The existing security solutions mainly focus on the authentication torealize that a user’s privative data cannot be illegally accessed, but neglect a subtle privacy issue during a user challenging thecloud server to request other users for data sharing. The challenged access request itself may reveal the user’s privacy no matterwhether or not it can obtain the data access permissions. In this paper, we propose a shared authority based privacy-preservingauthentication protocol (SAPA) to address above privacy issue for cloud storage. In the SAPA, 1) shared access authority isachieved by anonymous access request matching mechanism with security and privacy considerations (e.g., authentication, dataanonymity, user privacy, and forward security); 2) attribute based access control is adopted to realize that the user can only accessits own data fields; 3) proxy re-encryption is applied to provide data sharing among the multiple users. Meanwhile, universalcomposability (UC) model is established to prove that the SAPA theoretically has the design correctness. It indicates that theproposed protocol is attractive for multi-user collaborative cloud applications.Index Terms—Cloud computing, authentication protocol, privacy preservation, shared authority, universal composabilityÇ1 INTRODUCTIONCLOUD computing is a promising information technologyarchitecture for both enterprises and individuals. Itlaunches an attractive data storage and interactive paradigmwith obvious advantages, including on-demand selfservices,ubiquitous network access, and location independentresource pooling [1]. Towards the cloud computing, atypical service architecture is anything as a service (XaaS),in which infrastructures, platform, software, and others areapplied for ubiquitous interconnections. Recent studieshave been worked to promote the cloud computing evolvetowards the internet of services [2], [3]. Subsequently, securityand privacy issues are becoming key concerns with theincreasing popularity of cloud services. Conventional securityapproaches mainly focus on the strong authenticationto realize that a user can remotely access its own data in ondemandmode. Along with the diversity of the applicationrequirements, users may want to access and share each other’sauthorized data fields to achieve productive benefits,which brings new security and privacy challenges for thecloud storage.An example is introduced to identify the main motivation.In the cloud storage based supply chain management,there are various interest groups (e.g., supplier, carrier, andretailer) in the system. Each group owns its users which arepermitted to access the authorized data fields, and differentusers own relatively independent access authorities. Itmeans that any two users from diverse groups shouldaccess different data fields of the same file. Thereinto, a suppliermay want to access a carrier’s data fields, but it is notsure whether the carrier will allow its access request. If thecarrier refuses its request, the supplier’s access desire willbe revealed along with nothing obtained towards thedesired data fields. Actually, the supplier may not send theaccess request or withdraw the unaccepted request inadvance if it firmly knows that its request will be refused bythe carrier. It is unreasonable to thoroughly disclose thesupplier’s private information without any privacy considerations.Fig. 1 illustrates three revised cases to addressabove imperceptible privacy issue._ Case 1. The carrier also wants to access the supplier’sdata fields, and the cloud server should inform eachother and transmit the shared access authority to theboth users;_ Case 2. The carrier has no interest on other users’data fields, therefore its authorized data fieldsshould be properly protected, meanwhile the supplier’saccess request will also be concealed;_ Case 3. The carrier may want to access the retailer’sdata fields, but it is not certain whether the retailerwill accept its request or not. The retailer’s authorizeddata fields should not be public if the retailer_ H. Liu and Q. Xiong are with the School of Electronic and InformationEngineering, Beihang University, Beijing, China.E-mail: liuhongler@ee.buaa.edu.cn, qxxiong@buaa.edu.cn._ H. Ning is with the School of Computer and Communication Engineering,University of Science and Technology Beijing, Beijing, China, and theSchool of Electronic and Information Engineering, Beihang University,Beijing, China. E-mail: ninghuansheng@ustb.edu.cn._ L.T. Yang is with the School of Computer Science and Technology,Huazhong University of Science and Technology, Wuhan, Hubei, China,and the Department of Computer Science, St. Francis Xavier University,Antigonish, NS, Canada. E-mail: ltyang@stfx.ca.Manuscript received 3 Nov. 2013; revised 23 Dec. 2013; accepted 30 Dec.2013. Date of publication 24 Feb. 2014; date of current version 5 Dec. 2014.Recommended for acceptance by J. Chen.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2308218IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 2411045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.has no interests in the carrier’s data fields, and thecarrier’s request is also privately hidden.Towards above three cases, security protection and privacypreservation are both considered without revealing sensitiveaccess desire related information.In the cloud environments, a reasonable security protocolshould achieve the following requirements. 1) Authentication:a legal user can access its own data fields, only theauthorized partial or entire data fields can be identified bythe legal user, and any forged or tampered data fields cannotdeceive the legal user. 2) Data anonymity: any irrelevantentity cannot recognize the exchanged data and communicationstate even it intercepts the exchanged messages viaan open channel. 3) User privacy: any irrelevant entity cannotknow or guess a user’s access desire, which represents auser’s interest in another user’s authorized data fields. Ifand only if the both users have mutual interests in each other’sauthorized data fields, the cloud server will inform thetwo users to realize the access permission sharing. 4) Forwardsecurity: any adversary cannot correlate two communicationsessions to derive the prior interrogations accordingto the currently captured messages.Researches have been worked to strengthen security protectionand privacy preservation in cloud applications, andthere are various cryptographic algorithms to addresspotential security and privacy problems, including securityarchitectures [4], [5], data possession protocols [6], [7], datapublic auditing protocols [8], [9], [10], secure data storageand data sharing protocols [11], [12], [13], [14], [15], [16],access control mechanisms [17], [18], [19], privacy preservingprotocols [20], [21], [22], [23], and key management [24],[25], [26], [27]. However, most previous researches focus onthe authentication to realize that only a legal user can accessits authorized data, which ignores that different users maywant to access and share each other’s authorized data fieldsto achieve productive benefits. When a user challenges thecloud server to request other users for data sharing, theaccess request itself may reveal the user’s privacy no matterwhether or not it can obtain the data access permissions. Inthis work, we aim to address a user’s sensitive access desirerelated privacy during data sharing in the cloud environments,and it is significant to design a humanistic securityscheme to simultaneously achieve data access control,access authority sharing, and privacy preservation.In this paper, we address the aforementioned privacyissue to propose a shared authority based privacy-preservingauthentication protocol (SAPA) for the cloud data storage,which realizes authentication and authorization withoutcompromising a user’s private information. The main contributionsare as follows.1) Identify a new privacy challenge in cloud storage,and address a subtle privacy issue during a userchallenging the cloud server for data sharing, inwhich the challenged request itself cannot reveal theuser’s privacy no matter whether or not it can obtainthe access authority.2) Propose an authentication protocol to enhance auser’s access request related privacy, and the sharedaccess authority is achieved by anonymous accessrequest matching mechanism.3) Apply ciphertext-policy attribute based access controlto realize that a user can reliably access its owndata fields, and adopt the proxy re-encryption toprovide temp authorized data sharing among multipleusers.The remainder of the paper is organized as follows.Section 2 introduces related works. Section 3 introduces thesystem model, and Section 4 presents the proposed authenticationprotocol. The universal composability (UC) modelbased formal security analysis is performed in Section 5Finally, Section 6 draws a conclusion.2 RELATED WORKDunning and Kresman [11] proposed an anonymous IDassignment based data sharing algorithm (AIDA) for multipartyoriented cloud and distributed computing systems. Inthe AIDA, an integer data sharing algorithm is designed ontop of secure sum data mining operation, and adopts a variableand unbounded number of iterations for anonymousassignment. Specifically, Newton’s identities and Sturm’stheorem are used for the data mining, a distributed solutionof certain polynomials over finite fields enhances the algorithmscalability, and Markov chain representations are usedto determine statistics on the required number of iterations.Liu et al. [12] proposed a multi-owner data sharingsecure scheme (Mona) for dynamic groups in the cloudapplications. The Mona aims to realize that a user cansecurely share its data with other users via the untrustedcloud server, and can efficiently support dynamic groupinteractions. In the scheme, a new granted user can directlydecrypt data files without pre-contacting with data owners,and user revocation is achieved by a revocation list withoutupdating the secret keys of the remaining users. Access controlis applied to ensure that any user in a group can anonymouslyutilize the cloud resources, and the data owners’real identities can only be revealed by the group managerfor dispute arbitration. It indicates the storage overheadand encryption computation cost are independent with theamount of the users.Fig. 1. Three possible cases during data accessing and data sharing incloud applications.242 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Grzonkowski and Corcoran [13] proposed a zeroknowledgeproof (ZKP) based authentication scheme forcloud services. Based on the social home networks, a usercentric approach is applied to enable the sharing of personalizedcontent and sophisticated network-based servicesvia TCP/IP infrastructures, in which a trusted third partyis introduced for decentralized interactions.Nabeel et al. [14] proposed a broadcast group key management(BGKM) to improve the weakness of symmetrickey cryptosystem in public clouds, and the BGKM realizesthat a user need not utilize public key cryptography, andcan dynamically derive the symmetric keys during decryption.Accordingly, attribute based access control mechanismis designed to achieve that a user can decrypt thecontents if and only if its identity attributes satisfy the contentprovider’s policies. The fine-grained algorithm appliesaccess control vector (ACV) for assigning secrets to usersbased on the identity attributes, and allowing the users toderive actual symmetric keys based on their secrets andother public information. The BGKM has an obviousadvantage during adding/revoking users and updatingaccess control policies.Wang et al. [15] proposed a distributed storage integrityauditing mechanism, which introduces the homomorphictoken and distributed erasure-coded data to enhance secureand dependable storage services in cloud computing. Thescheme allows users to audit the cloud storage with lightweightcommunication overloads and computation cost,and the auditing result ensures strong cloud storage correctnessand fast data error localization. Towards the dynamiccloud data, the scheme supports dynamic outsourced dataoperations. It indicates that the scheme is resilient againstByzantine failure, malicious data modification attack, andserver colluding attacks.Sundareswaran et al. [16] established a decentralizedinformation accountability framework to track the users’actual data usage in the cloud, and proposed an objectcenteredapproach to enable enclosing the logging mechanismwith the users’ data and policies. The Java ARchives(JAR) programmable capability is leveraged to create adynamic and mobile object, and to ensure that the users’data access will launch authentication. Additionally, distributedauditing mechanisms are also provided to strengthenuser’s data control, and experiments demonstrate theapproach efficiency and effectiveness.In the aforementioned works, various security issues areaddressed. However, a user’s subtle access request relatedprivacy problem caused by data accessing and data sharinghas not been studied yet in the literature. Here, we identifya new privacy challenge, and propose a protocol not onlyfocusing on authentication to realize the valid data accessing,but also considering authorization to provide the privacy-preserving access authority sharing. The attributebased access control and proxy re-encryption mechanismsare jointly applied for authentication and authorization.3 SYSTEM MODELFig. 2 illustrates a system model for the cloud storage architecture,which includes three main network entities: users(Ux), a cloud server (S), and a trusted third party._ User. An individual or group entity, which owns itsdata stored in the cloud for online data storage andcomputing. Different users may be affiliated with acommon organization, and are assigned with independentauthorities on certain data fields._ Cloud server. An entity, which is managed by aparticular cloud service provider or cloud applicationoperator to provide data storage and computingservices. The cloud server is regarded as anentity with unrestricted storage and computationalresources._ Trusted third party. An optional and neutral entity,which has advanced capabilities on behalf of theusers, to perform data public auditing and disputearbitration.In the cloud storage, a user remotely stores its data viaonline infrastructures, flatforms, or software for cloud services,which are operated in the distributed, parallel, andcooperative modes. During cloud data accessing, the userautonomously interacts with the cloud server without externalinterferences, and is assigned with the full and independentauthority on its own data fields. It is necessary toguarantee that the users’ outsourced data cannot be unauthorizedaccessed by other users, and is of critical importanceto ensure the private information during the users’data access challenges. In some scenarios, there are multipleusers in a system (e.g., supply chain management), and theusers could have different affiliation attributes from differentinterest groups. One of the users may want to accessother associated users’ data fields to achieve bi-directionaldata sharing, but it cares about two aspects: whether theaimed user would like to share its data fields, and how toavoid exposing its access request if the aimed user declinesor ignores its challenge. In the paper, we pay more attentionon the process of data access control and access authoritysharing other than the specific file oriented cloud datamanagement.In the system model, assume that point-to-point communicationchannels between users and a cloud server are reliablewith the protection of secure shell protocol (SSH). Therelated authentication handshakes are not highlighted inthe following protocol presentation.Towards the trust model, there are no full trust relationshipsbetween a cloud server S and a user Ux._ S is semi-honest and curious. Being semi-honest meansthat S can be regarded as an entity that appropriatelyfollows the protocol procedure. Being curiousFig. 2. The cloud storage system model.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 243means that S may attempt to obtain Ux’s privateinformation (e.g., data content, and user preferences).It means that S is under the supervision of itscloud provider or operator, but may be interested inviewing users’ privacy. In the passive or honest-butcuriousmodel, S cannot tamper with the users’ datato maintain the system normal operation with undetectedmonitoring._ Ux is rational and sensitive. Being rational means thatUx’s behavior would be never based on experienceor emotion, and misbehavior may only occur for selfishinterests. Being sensitive means that Ux is reluctantto disclosure its sensitive data, but has stronginterests in other users’ privacy.Towards the threat model, it covers the possible securitythreats and system vulnerabilities during cloud data interactions.The communication channels are exposed in public,and both internal and external attacks exist in the cloudapplications [15]. The internal attacks mainly refer to theinteractive entities (i.e., S, and Ux). Thereinto, S may be selfcenteredand utilitarian, and aims to obtain more user datacontents and the associated user behaviors/habits for themaximization of commercial interests; Ux may attempt tocapture other users’ sensitive data fields for certain purposes(e.g., curiosity, and malicious intent). The externalattacks mainly consider the data CIA triad (i.e., confidentiality,integrity, and availability) threats from outside adversaries,which could compromise the cloud data storageservers, and subsequently modify (e.g., insert, or delete) theusers’ data fields.4 THE SHARED AUTHORITY BASED PRIVACYPRESERVINGAUTHENTICATION PROTOCOL4.1 System InitializationThe cloud storage system includes a cloud server S, andusers {Ux} (x ¼ f1; . . .;mg, m 2 N_). Thereinto, Ua and Ubare two users, which have independent access authoritieson their own data fields. It means that a user has an accesspermission for particular data fields stored by S, and theuser cannot exceed its authority access to obtain other users’data fields. Here, we consider S and {Ua, Ub} to present theprotocol for data access control and access authority sharingwith enhanced privacy considerations. The main notationsare introduced in Table 1.Let BG ¼ ðq; g; h;G;G0; e;HÞ be a pairing group, in whichq is a large prime, {G;G0} are of prime order q, G ¼ hgi ¼ hhi,and H is a collision-resistant hash function. The bilinearmap e : G _ G ! G0 satisfies the bilinear non-degenerateproperties: i.e., for all g; h 2 G and a; b 2 Z_q , it turns out thateðga; hbÞ ¼ eðg; hÞab, and eðg; hÞ 6¼ 1. Meanwhile, eðg; hÞ canbe efficiently obtained for all g; h 2 G, and it is a generatorof G0.Let S and Ux respectively own the pairwise keys {pkS,skS} and {pkUx , skUx }. Besides, S is assigned with all users’public keys {pkU1 ; . . . ; pkUm}, and Ux is assigned with pkS.Here, the public key pkt ¼ gskt ðmod qÞ (t 2 fS;Uxg) and thecorresponding privacy key skt 2 Z_q are defined accordingto the generator g.Let FðRUyUx ðRUxUy ÞT Þ¼Cont2Zq describe the algebraic relation of{RUyUx , RUxUy }, which are mutually inverse access requests challengedby {Ux, Uy}, and Cont is a constant. Here, Fð:Þ is acollision-resistant function, for any randomized polynomialtime algorithm A, there is a negligible function pðkÞ for asufficiently large value k:Probhfðx; x0Þ; ðy; y0Þg Að1kÞ : ðx 6¼ x0; y 6¼ y0Þ^F_RUxUy_RU0yU0x_T_¼ Conti_ pðkÞ:Note that RU_ Uyis a m-dimensional Boolean vector, inwhich only the _-th pointed-element and the y-th selfelementare 1, and other elements are 0. It turns out that:_ FðRUyUx ðRUxUy ÞT Þ¼Fð2Þ¼Cont means that both Ux and Uy areinterested in each other’s data fields, and the twoaccess requests are matched;_ FðRUyUx ðRU~xUy ÞTÞ ¼ FðRU~yUx ðRUxUy ÞTÞ ¼ Fð1Þ means thatonly one user (i.e., Ux or Uy) is interested in theother’s data fields, and the access requests are notmatched. Note that U~x/U~y represents that the user isnot Ux/Uy;_ FðRU~yUx ðRU~xUy ÞTÞ ¼ Fð0Þ means that neither Ux nor Uy isinterested in each other’s data fields, and the twoaccess requests are not matched.Let A be the attribute set, there are n attributesA ¼ fA1;A2; . . .; Ang for all users, and Ux has its own attributeset AUx _ A for data accessing. Let AUx and PUx bemonotone Boolean matrixes to represent Ux’s data attributeaccess list and data access policy._ Assume that Ux has AUx ¼ ½aij_n_m, which satisfiesthat aij ¼ 1 for Ai 2 A, and aij ¼ 0 for Ai =2 A._ Assume that S owns PUx ¼ ½pij_n_m, which is appliedto restrain Ux’s access authority, and satisfies thatpij ¼ 1 for Ai 2 PUx , and pij ¼ 0 for Ai =2 PUx. Ifaij _ pij8i ¼ f1; . . . ; ng; j ¼ f1; . . .;mg holds, it willbe regarded that AUx is within PUx ’s access authoritylimitation.TABLE 1Notations244 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Note that full-fledged cryptographic algorithms (e.g.,attribute based access control, and proxy re-encryption) canbe exploited to support the SAPA.4.2 The Proposed Protocol DescriptionsFig. 3 shows the interactions among {Ua, Ub, S}, in whichboth Ua and Ub have interests on each other’s authorizeddata fields for data sharing. Note that the presented interactionsmay not be synchronously launched, and a certaintime interval is allowable.4.2.1 {Ua, Ub}’s Access Challenges and S’s Responses{Ua, Ub} respectively generate the session identifiers {sidUa ,sidUb }, extract the identity tokens {TUa , TUb }, and transmits{sidUakTUa , sidUbkTUa} to S as an access query to initiate anew session. Accordingly, we take the interactions of Uaand S as an example to introduce the following authenticationphase. Upon receiving Ua’s challenge, S first generatesa session identifier sidSa , and establishes the master publickey mpk ¼ ðgi; h; hi; BG; eðg; hÞ;HÞ and master privacy keymsk ¼ ða; gÞ. Thereinto, S randomly chooses a 2 Zq, andcomputes gi ¼ gaiand hi ¼ hai_1(i ¼ f1; . . . ; ng 2 Z_).S randomly chooses s 2 f0; 1g_, and extracts Ua’s accessauthority policy PUa ¼ ½pij_n_m (pij 2 f0; 1g), and Ua isassigned with the access authority on its own data fieldsDUa within PUa ’s permission. S further defines a polynomialFSa ðx; PUa Þ according to PUa and TUa :FSa ðx; PUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞpij ðmod qÞ:S computes a set of values {MSa0, MSa1, fMSa2ig, MSa3,MSa4} to establish the ciphertext CSa ¼ fMSa1; fMSa2ig;MSa3;MSa4g, and transmits sidSakCSa to Ua.MSa0 ¼ HðPUakDUakTUaksÞ;MSa1 ¼ hFSa ða;PUa ÞMSa0 ;MSa2i ¼ ðgiÞMSa0 ; ði ¼ 1; . . . ; nÞ;MSa3 ¼ Hðeðg; hÞMSa0Þ s;MSa4 ¼ HðsidUaksÞ DUa :Similarly, S performs the corresponding operationsfor Ub, including that S randomly chooses a0 2 Zq ands02 f0; 1g_, establishes {g0i, h0i}, extracts {PUb , DUb },defines FSb ðx; PUb Þ, and computes {MSb0, MSb1, fMSb2ig,MSb3, MSb4} to establish the ciphertext CSb fortransmission.4.2.2 {Ua, Ub}’s Data Access ControlUa first extracts it data attribute access list AUa ¼ ½aij_(aij 2 f0; 1g, aij _ pij) to re-structure an access listLUa ¼ ½lij_n_m for lij ¼ pij _ aij. Ua also defines a polynomialFUa ðx;LUa Þ according to LUa and TUa :FUa ðx;LUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞlij ðmod qÞ:It turns out that FUa ðx;LUa Þ satisfies the equationFUa ðx;LUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞpij_aij¼ FSa ðx; PUa Þ=FSa ðx;AUa Þ:Afterwards, Ua randomly chooses b 2 Zq, and the decryptionkey kAUa for AUa can be obtained as follows:kAUa ¼ ðgðbþ1Þ=FSa ða;AUa Þ; hb_1Þ:Ua further computes a set of values {NUa1, NUa2, NUa3}.Here, fSai is used to represent xi’s coefficient inFSa ðx; PUa Þ, and fUai is used to represent xi’s coefficientin FUa ðx; LUa Þ:NUa1 ¼e MSa21;Yni¼1ðhiÞfUaihfUa0!;NUa2 ¼ eYni¼1ðMSa2iÞfUai; hb_1!;NUa3 ¼ eðgðbþ1Þ=FSa ða;AUa Þ;MSa1Þ:Fig. 3. The shared authority based privacy-preserving authentication protocol.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 245It turns out that eðg; hÞMSa0 satisfies the equationeðg; hÞMSa0 ¼NUa3ðNUa1NUa2Þ_ _1=fUa0:For the right side of (1), we have,NUa1 ¼ egaiMSa0 ;Yni¼1ðhiÞfUaihfUa0!¼ eðg; hÞaMSa0Pni¼1ðai_1fUaiþfUa0Þ¼ eðg; hÞMSa0FUa ða;LUa Þ;NUa2 ¼ eYni¼1gaiMSa0fUai; hb_1!¼ eðg; hÞMSa0_Pni¼1aifUaiþfUa0_fUa0_ðb_1Þ¼ eðg; hÞMSa0bFUa ða;LUaÞ_MSa0fUa0 ;NUa3 ¼ egðbþ1Þ=FSa ða;AUa Þ; hfSa0MSa0Yni¼1ðhiÞfSaiMSa0!¼ eðg; hÞðbþ1Þ=FSa ða;AUa ÞFSa ða;PUa ÞMSa0¼ eðg; hÞMSa0bFUa ða;LUaÞþMSa0FUa ða;LUa Þ:Ua locally re-computes {s‘, M‘Sa0}, derives its own authorizeddata fields DUa , and checks whether the ciphertext CSais encrypted by M‘Sa0. If it holds, Ua will be a legal user thatcan properly decrypt the ciphertext CSa ; otherwise, the protocolwill terminates‘ ¼ MSa3 Hðeðg; hÞMSa0 Þ;M‘Sa0 ¼ H_PUakDUakTUaks‘_;DUa ¼ MSa4 H_sidUaks‘_:Ua further extracts its pseudonym PIDUa , a sessionsensitiveaccess request RUbUa, and the public key pkUa .Here, RUbUa is introduced to let S know Ua’s data accessdesire. It turns out that RUbUa makes S know the facts: 1) Uawants to access Ub’s temp authorized data fields _DUb ;2) Ra will also agree to share its temp authorized datafields _DUa with Ub in the case that Ub grants its request.Afterwards, Ua randomly chooses rUa 2 Z_q , computes aset of values {MUa0, MUa1, MUa2, MUa3} to establish a ciphertextCUa , and transmits CUa to S for further access requestmatchingMUa0 ¼ HðsidSakPIDUaÞ RUbUa;MUa1 ¼ gpkUa rUa ;MUa2 ¼ eðg; hÞrUa ;MUa3 ¼ hrUa :Similarly, Ub performs the corresponding operations,including that Ub extracts AUb , and determines {LUb ,FUb ðx;LUb Þ, fUbi}. Ub further randomly chooses b0 2 Zq, andcomputes the values {NUb1, NUb2, NUb3, s0‘, M‘Ub} to derive itsown data fields DUb . Ub also extracts its pseudonym PIDUband an access request RUaUbto establish a ciphertext CUb withthe elements {MUb0;MUb1;MUb2;MUb3}.4.2.3 {Ua, Ub}’s Access Request Matching and DataAccess Authority SharingUpon receiving the ciphertexts {CUa , CUb } within an allowabletime interval, and S extracts {PIDUa , PIDUb } to derivethe access requests {RUbUa , RUaUb}:RUbUa ¼ HðsidSakPIDUaÞ MUa0;RUaUb ¼ HðsidSbkPIDUbÞ MUb0:S checks whether {RUbUa , RUaUb} satisfy FðRUbUa ðRUaUb ÞTÞ ¼Fð2Þ ¼ Cont. If it holds, S will learn that both Ua and Ubhave the access desires to access each other’s authorizeddata, and to share its authorized data fields with each other.S extracts the keys {skS, pkUa , pkUb } to establish the aggregatedkeys {kS, kSu } by the Diffie-Hellman key agreement,and computes the available re-encryption key kUu for Uu(u 2 fa; bg):kS ¼ ðpkUapkUb ÞskS ¼ gðskUaþskUb ÞskS ;kSu ¼ ðpkUu ÞskS ¼ gskUuskS ;kUu ¼ kSu=pkUu :S performs re-encryption to obtainM0Uu1. Towards Ua/Ub,S extracts Ub/Ua’s temp authorized data fields _DUb/ _DUa tocomputeM0Ub2/M0Ua2:M0Uu1 ¼ ðMUu1ÞkUu ¼ gkSurUu ;M0Ua2 ¼ MUa2EkSb ð _DUa Þ;M0Ub2 ¼ MUb2EkSa ð _DUb Þ:Thereafter, S establishes the re-structured ciphertextC0Uu ¼ ðM0Uu1;M0Uu2;MUu3Þ, and respectively transmits{C0UbkkS, C0UakkS} to {Ua, Ub} for access authority sharing.Upon receiving the messages, Ua computes kSa ¼ ðpkSÞskUa ,and performs verification by comparing the followingequation:e_M0Ub1; h_¼?eðgkS=kSa;MUb3Þ:For the left side of (2), we have,e_M0Ub1; h_¼ e_ggskUbskS rUb ; h_:For the right side of (2), we have,e_gkS=kSa;MUb3_¼ eðgðpkSÞskUb ; hrUb Þ¼ eðg; hÞgskSskUb rUb :246 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Ua derives Ub’s temp authorized data fields _DUb :_DUb ¼ E_1kSa_M0Ub2e_M0Ub1; h__kSa=kS_:Similarly, Ub performs the corresponding operations,including that Ub obtains the keys {kS, kSb }, checks Ub’svalidity, and derives the temp authorized data field _DUa .In the SAPA, S acts as a semi-trusted proxy to realize{Ua, Ub}’s access authority sharing. During the proxy reencryption,{Ua, Ub} respectively establish ciphertexts{MUa1, MUb1} by their public keys {pkUa , pkUb }, and S generatesthe corresponding re-encryption keys {kUa , kUb} for {Ua,Ub}. Based on the re-encryption keys, the ciphertexts {MUa1,MUb1} are re-encrypted into {M0Ua1, M0Ub1}, and {Ua, Ub} candecrypt the re-structured ciphertexts {M0Ub1, M0Ua1} by theirown private key {skUa , skUb } without revealing any sensitiveinformation.Till now, {Ua, Ub} have realized the access authority sharingin the case that both Ua and Ub have the access desireson each other’s data fields. Meanwhile, there may be othertypical cases when Ua has an interest in Ub’s data fields witha challenged access request RUbUa .1. In the case that Ub has no interest in Ua’s data fields,it turns out that Ub’s access request RUbUband RUbUa satisfythat FðRUbUa ðRUbUb ÞT Þ¼Fð1Þ. For Ua, S will extract adummy data fields Dnull as a response. Ub will beinformed that a certain user is interested in its datafields, but cannot determine Ua’s detailed identityfor privacy considerations.2. In the case that Ub has an interest in Uc’s data fieldsrather than Ua’s data fields, but Uc has no interest inUb’s data fields. It turns out that the challengedaccess requests RUbUa , RUcUb, and RU~bUc satisfy thatFðRUbUa ðRUcUb ÞT Þ¼FðRUcUb ðRU~bUc ÞT Þ¼Fð1Þ, in which U~b indicatesthat the user is not Ub. Dnull will be transmitted to{Ua, Ub, Uc} without data sharing.In summary, the SAPA adopts integrative approaches toaddress secure authority sharing in cloud applications._ Authentication. The ciphertext-policy attribute basedaccess control and bilinear pairings are introducedfor identification between Uu and S, and only thelegal user can derive the ciphertexts. Additionally,Uu checks the re-computed ciphertexts according tothe proxy re-encryption, which realizes flexible datasharing instead of publishing the interactive users’secret keys._ Data anonymity. The pseudonym PIDUu are hiddenby the hash function so that other entities cannotderives the real values by inverse operations.Meanwhile, U~u ’s temp authorized fields _DU~uareencrypted by kSu for anonymous data transmission.Hence, an adversary cannot recognize thedata, even if the adversary intercepts the transmitteddata, it will not decode the full-fledged cryptographicalgorithms._ User privacy. The access request pointer (e.g., RUxUu) iswrapped along with HðsidSukPIDUu Þ for privatelyinforming S about Uu’s access desires. Only if bothusers are interested in each other’s data fields, S willestablish the re-encryption key kUu to realize authoritysharing between Ua and Ub. Otherwise, S willtemporarily reserve the desired access requests for acertain period of time, and cannot accurately determinewhich user is actively interested in the otheruser’s data fields._ Forward security. The dual session identifiers {sidSu ,sidUu } and pseudorandom numbers are introducedas session variational operators to ensure the communicationsdynamic. An adversary regards theprior session as random even if {S, Uu} get corrupted,or the adversary obtains the PRNG algorithm. Thecurrent security compromises cannot correlate withthe prior interrogations.5 FORMAL SECURITY ANALYSIS WITH THEUNIVERSAL COMPOSABILITY MODEL5.1 PreliminariesThe universal composability model specifies an approachfor security proofs [28], and guarantees that the proofs willremain valid if the protocol is modularly composed withother protocols, and/or under arbitrary concurrent protocolexecutions. There is a real-world simulation, an ideal-worldsimulation, and a simulator Sim translating the protocolexecution from the real-world to the ideal-world. Additionally,the Byzantine attack model is adopted for securityanalysis, and all the parties are modeled as probabilisticpolynomial-time Turing machines (PPTs), and a PPT captureswhatever is external to the protocol executions. Theadversary controls message deliveries in all communicationchannels, and may perform malicious attacks (e.g., eavesdropping,forgery, and replay), and may also initiate newcommunications to interact with the legal parties.In the real-world, let p be a real protocol, Pi (i ¼ f1; . . . ;Ig 2 N_) be real parties, and A be a real-world adversary. Inthe ideal-world, let F be an ideal functionality, ~ Pi bedummy parties, and ~A be an ideal-world adversary. Z is aninteractive environment, and communicates with all entitiesexcept the ideal functionality F. Ideal functionality acts asan uncorruptable trusted party to realize specific protocolfunctions.Theorem 1. UC Security. The probability, that Z distinguishesbetween an interaction of A with Pi and an interactionof ~A with ~ Pi, is at most negligible probability. We havethat a real protocol p UC-realizes an ideal functionality F,i.e., IdealF; ~ A;Z Realp;A;Z.The UC formalization of the SAPA includes the idealworldmodel Ideal, and the real-world model Real._ Ideal: Define two uncorrupted idea functionalities{Faccess, Fshare}, a dummy party ~ P (e.g., ~ Uu, ~ S,u 2 fa; bg), and an ideal adversary ~ A. { ~ P, ~ A} cannotestablish direct communications. ~ A can arbitrarilyinteract with Z, and can corrupt any dummy party~ P, but cannot modify the exchanged messages._ Real: Define a real protocol pshare (run by a partyP including Uu and S) with a real adversary A andan environment Z. Each real parties canLIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 247communicate with each other, and A can fully controlthe interconnections of P to obtain/modify theexchanged messages. During the protocol execution,Z is activated first, and dual session identifiersshared by all the involved parties reflects theprotocol state.5.2 Ideal FunctionalityDefinition 1. Functionality Faccess. Faccess is an incorruptibleideal data accessing functionality via available channels, asshown in Table 2.In Faccess, a party P (e.g., Uu, S) is initialized (via inputInitialize), and thereby initiates a new session along withgenerating dual session identifiers {sidUu , sidSu }. P followsthe assigned protocol procedure to send (via input Send)and receive (via input Receive) messages. A random numberrPu is generated by P for further computation (via inputGenerate). Data access control is realized by checking{sendð:Þ, recð:Þ, localð:Þ} (via input Access). If P is controlledby an ideal adversary ~ A, four types of behaviors may beperformed: ~ A may record the exchanged messages on listenedchannels, and may forward the intercepted messagesto P (via request Forward); ~ A may record the state ofauthentication between Uu and S to interfere in the normalverification (via request Accept); ~ A may impersonate anlegal party to obtain the full state (via request Forge), andmay replay the formerly intercepted messages to involvethe ongoing communications (via request Replay).Definition 2. Functionality Fshare. Fshare is an incorruptibleideal authority sharing functionality, as shown in Table 3.Fshare is activated by P (via input Activate), and the initializationis performed via Initialize of Faccess. The accessrequest pointers {RUbUa , RUaUb} are respectively published andchallenged by {Ua, Ub} to indicate their desires (via inputChallenge). The authority sharing between {Ua, Ub} is realized,and the desired data fields { _D Ub , _D Ua } are accordinglyobtained by {Ua, Ub} (via input Share). If P is controlled byan ideal adversary ~ A, ~ A may detect the exchanged challengedaccess request pointer RUxUu(via request Listen); ~ Amay record the request pointer to interfere in the normalauthority sharing between Ua and Ub (via requestForge/Replay).In the UC model, Faccess and Fshare formally define thebasic components of the ideal-world simulation._ Party. Party P refers to multiple users Uu (e.g., Ua,Ub), and a cloud server S involved in a session.Through a successful session execution, {Uu, S} establishauthentication and access control, and {Ua, Ub}TABLE 3Ideal Authority Sharing Functionality: FshareTABLE 2Ideal Data Accessing Functionality: Faccess248 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015obtain each other’s temp authorized data fields fordata authority sharing._ Session identifier. The session identifiers sidUu andsidSu are generated for initialization by the environmentZ. The ideal adversary ~ A may control and corruptthe interactions between Uu and S._ Access request pointer. The access request pointer RUxUuis applied to indicate Uu’s access request on Ux’stemp authorized data fields _D Ux .5.3 Real Protocol pshareA real protocol pshare is performed based on the ideal functionalitiesto realize Fshare in Faccess-hybrid model.Upon input ActivateðPÞ at P (e.g., Uu, and S), P is activatedvia Fshare to trigger a new session, in whichInitialize of Faccess is applied for initialization and assignment.{initðsidUu ; UuÞ, initðsidSu ; SÞ} are respectivelyobtained by {Uu, S}. Message deliveries are accordingly performedby inputting Send and Receive. Upon input Sendfrom Uu, Uu records and outputs sendðsidUu ; UuÞ via Faccess.Upon input Receive from S, S obtains recðsidUu ; SÞ viaFaccess. Upon input GenerateðSÞ from S, S randomly choosesa random number rSu to output genðrSu Þ and to establisha ciphertext for access control. Upon input GenerateðUuÞfrom Uu, Uu generates a random number rUu for furtherchecking the validity of {AUu , PUu }. Upon input Access fromUu, Uu checks whether {sendð:Þ, recð:Þ, localð:Þ} are matchedvia Faccess. If it holds, output validðAUu; PUu Þ is valid. Else,output invalidðAUu; PUu Þ and terminate the protocol. Uponinput ChallengeðUxÞ from Uu, Uu generates an accessrequest pointer RUxUu, and outputs challðRUxUu Þ to Ux. Uponinput Send from Uu, Uu computes a message mUu , recordsand outputs sendðmUu ; UuÞ via Faccess, in which RUxUuiswrapped in mUu . Upon input Receive from S, S obtainsrecðmUu ; SÞ for access request matching. Upon inputShareð _D Ub ; UaÞ and Shareð _D Ua ; UbÞ from {Ua, Ub}, S checkswhether {challðRUbUa ; UaÞ, challðRUaUb; UbÞ} are matched. If itholds, output shareð _D Ub ; UaÞ to Ua and shareð _D Ua ; UbÞ to Ubto achieve data sharing. Else, output shareðDnull; UaÞ to Uaand shareðDnull; UbÞ to Ub for regular data accessing.5.4 Security Proof of pshareTheorem 3. The protocol pshare UC-realizes the ideal functionalityFshare in the Faccess-hybrid model.Proof: Let A be a real adversary that interacts with the partiesrunning pshare in the Faccess-hybrid model. Let ~ A bean ideal adversary such that any environment Z cannotdistinguish with a non-negligible probability whether itis interacting with A and pshare in Real or it is interactingwith ~ A and Fshare in Ideal. It means that there is a simulatorSim that translates pshare procedures into Real suchthat these cannot be distinguished by Z.Construction of the ideal adversary ~ A: The ideal adversary~ A acts as Sim to run the simulated copies of Z, A,and P. ~ A correlates runs of pshare from Real into Ideal:the interactions of A and P is corresponding to the interactionsof ~ A and ~ P. The input of Z is forwarded to A asA’s input, and the output of A (after running pshare) iscopied to ~ A as ~ A’s output.Simulating the party P. Uu and S are activated and initializedby Activate and Initialization, and ~ A simulatesas A during interactions._ Whenever ~ A obtains {initðsidPu ; PÞ, genðrPu ; PÞ}from Faccess, ~ A transmits the messages to A._ Whenever ~ A obtains {recð:Þ, sendð:Þ} from Faccess,~ A transmits the messages to A, and forwards A’sresponse forwardðsidPu;mPu ; PÞ to Faccess._ Whenever ~ A obtains {initð:Þ, forwardð:Þ} fromFaccess, S transmits the messages to A, and forwardsA’s response acceptðPÞ to Faccess._ Whenever ~ A obtains challðRUxUu; UuÞ from Fshare, ~ Atransmits the message to A, and forwards A’sresponse listenðRUxUu; UuÞ to Fshare.Simulating the party corruption. Whenever P is corruptedby A, thereby ~ A corrupts the corresponding ~ P. ~ Aprovides A with the corrupted parties’ internal states._ Whenever ~ A obtains accessðDUu Þ from Faccess, ~ Atransmits the message accessðDUu Þ to A, and forwardsA’s response acceptðPÞ to Faccess._ Whenever ~ A obtains challðRUxUu; UuÞ from Fshare, ~ Atransmits the message to A, and forwards A’sresponse shareðDnull; UuÞ to Fshare.Ideal and Real are indistinguishable: Assume that {CS,CUu} respectively indicate the events that corruptions of{S, U}. Z invokes Activate and Initialize to launch aninteraction. The commands Generate and Access areinvoked to transmit accessðDUu Þ to ~ A, and A respondsacceptðPÞ to ~ A. Thereafter, Challenge and Share areinvoked to transmit shareðRUxUu; UuÞ, and A respondsshareðDnull; UuÞ to ~ A. Note that initð:Þ independentlygenerates dual session identifiers {sidUu , sidSu }, and thesimulations of Real and Ideal are consistent eventhough ~ A may intervene to prevent the data access controland authority sharing in Ideal. The pseudorandomnumber generator (introduced in {initð:Þ, genð:Þ}), andthe collision-resistant hash function (introduced in{accessð:Þ, shareð:Þ}) are applied to guarantee that theprobability of the environment Z can distinguish theadversary’s behaviors in Ideal and Real is at most negligible.The simulation is performed based on the fact thatno matter the event CS or CUu occurs or not, Therefore,pshare UC-realizes the ideal functionality Fshare in theFaccess-hybrid model. tu6 CONCLUSIONIn this work, we have identified a new privacy challengeduring data accessing in the cloud computing to achieveprivacy-preserving access authority sharing. Authenticationis established to guarantee data confidentiality anddata integrity. Data anonymity is achieved since thewrapped values are exchanged during transmission. Userprivacy is enhanced by anonymous access requests to privatelyinform the cloud server about the users’ accessLIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 249desires. Forward security is realized by the session identifiersto prevent the session correlation. It indicates that theproposed scheme is possibly applied for privacy preservationin cloud applications.ACKNOWLEDGMENTSThis work was funded by DNSLAB, China Internet NetworkInformation Center, Beijing 100190, China. [28] R. Canetti, “Universally Composable Security: A New Paradigmfor Cryptographic Protocols,” Proc. 42nd IEEE Symp. Foundationsof Computer Science (FOCS ’01), pp. 136-145, Oct. 2001.Hong Liu is currently working toward the PhDdegree at the School of Electronic and InformationEngineering, Beihang University, China. Shefocuses on the security and privacy issues inradio frequency identification, vehicle-to-grid networks,and Internet of Things. Her research interestsinclude authentication protocol design, andsecurity formal modeling and analysis. She is astudent member of the IEEE.Huansheng Ning received the BS degree fromAnhui University in 1996 and the PhD degreefrom Beihang University in 2001. He is a professorin the School of Computer and CommunicationEngineering, University of Science andTechnology Beijing, China. His current researchinterests include Internet of Things, aviationsecurity, electromagnetic sensing and computing.He has published more than 50 papers injournals, international conferences/workshops.He is a senior member of the IEEE.250 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Qingxu Xiong received the PhD degree in electricalengineering from Peking University, Beijing,China, in 1994. From 1994 to 1997, he worked inthe Information Engineering Department at theBeijing University of Posts and Telecommunicationsas a postdoctoral researcher. He is currentlya professor in the School of Electrical andInformation Engineering at the Beijing Universityof Aeronautics and Astronautics. His researchinterests include scheduling in optical and wirelessnetworks, performance modeling of wirelessnetworks, and satellite communication. He is a member of the IEEE.Laurence T. Yang received the BE degree incomputer science from Tsinghua University,China, and the PhD degree in computer sciencefrom the University of Victoria, Canada. He is aprofessor in the School of Computer Scienceand Technology at the Huazhong University ofScience and Technology, China, and in theDepartment of Computer Science, St. FrancisXavier University, Canada. His research interestsinclude parallel and distributed computing,and embedded and ubiquitous/pervasive computing.His research is supported by the National Sciences and EngineeringResearch Council and the Canada Foundation for Innovation.He is amember of the IEEE.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 251

Security Optimization of Dynamic Networks with Probabilistic Graph Modeling and Linear Programming

05/08/201902/07/2019 by admin

Large organizations need rigorous security tools for analyzing potential vulnerabilities in their networks. However, managing large-scale networks with complex configurations is technically challenging. For example, organizational networks are usually dynamic with frequent configuration changes. These changes may include changes in the availability and connectivity of hosts and other devices, and services added to or removed from the network. Network administrators also need to respond to newly discovered vulnerabilities by applying patches and modifications to the network configuration and security policies, or utilizing defensive security resources to minimize the risk from external attacks. For instance, to prevent a remote attack targeting a host it is useful to analyze the candidate defensive strategies in choosing installation and runtime parameters for one or several intrusion prevention systems. To facilitate a scalable security analysis of organizational networks, attack graphs were proposed. Attack graphs show possible attack paths with respect to a particular network setting, which provide the necessary elements for modeling and improving the security of the network.

Existing work utilizes attack graphs for analyzing the security risks by quantifying attack graphs using a variety of techniques such as Bayesian belief propagation basic laws of probability and vertex ranking algorithms. These models lack a systematic and scalable computation of optimized network configurations. Current attack graph quantification models assume a network with known and fixed configurations in terms of the connectivity, availability and policies of the network services and components disregarding the dynamic nature of modern networks. Moreover, except for a few attempts previous work has solely focused on computing a numerical representation of the risk without addressing the more challenging problem of risk management and reduction.

In this paper, we present a rigorous probabilistic model that measures the security risk as the proba- bility of success in an attack. Our probabilistic model referred to as the success measurement model has three main features: (i) rigorous and scalable model with a clear probabilistic semantic, (ii) computation of risk probabilities with the goal of finding the maximum attack capabilities, and (iii) considering dynamic network features and the availability of mobile devices in the network. As an application of our success measurement model, we formalize the problem of utilizing network security resources as an optimization problem with the goal of computing an optimal placement of security products across a network. Our new contribution is to define this optimization problem and provide an efficient algorithm based on a standard technique called sequential linear programming. Our algorithm is proved to converge and it is scalable to large networks with thousands of components and attack paths.

Our contributions in this paper include:

• A scalable probabilistic model that uses a Bernoulli model to measure the risk in terms of the probability of success to achieve an attack goal.

• An efficient security optimization model, generated based on a quantified attack graph, to compute an optimal placement of security products according to organizational and technical constraints.

• Modeling dynamic network features for a realistic and accurate analysis of the risk associated with modern networks.

The results of our experiments confirm three key properties of our model. First, the vulnerability values computed from our model are accurate. Our manual inspection of the results confirms that the probability values obtained in the experiments correlate to the vulnerabilities of components in the network. Second, our security improvement method efficiently finds the optimal placement of security products subject to constraints. Third, we quantify the additional vulnerabilities introduced by mobile devices of a dynamic network. Our results indicate that an infected mobile device within the trusted region creates a preferred attack direction towards the attack target, which increases the chance of success at the target host. Our implementation efficiently computes the probabilities throughout large attack graphs with a quadratic execution performance.

1.3 LITRATURE SURVEY

DYNAMIC SECURITY RISK MANAGEMENT USING BAYESIAN ATTACK GRAPHS

AUTHOR: N. Poolsappasit, R. Dewri, and I. Ray

PUBLISH: IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 1, pp. 61–74, Jan 2012.

EXPLANATION:

Security risk assessment and mitigation are two vital processes that need to be executed to maintain a productive IT infrastructure. On one hand, models such as attack graphs and attack trees have been proposed to assess the cause-consequence relationships between various network states, while on the other hand, different decision problems have been explored to identify the minimum-cost hardening measures. However, these risk models do not help reason about the causal dependencies between network states. Further, the optimization formulations ignore the issue of resource availability while analyzing a risk model. In this paper, we propose a risk management framework using Bayesian networks that enable a system administrator to quantify the chances of network compromise at various levels. We show how to use this information to develop a security mitigation and management plan. In contrast to other similar models, this risk model lends itself to dynamic analysis during the deployed phase of the network. A multi objective optimization platform provides the administrator with all trade-off information required to make decisions in a resource constrained environment.

TIME-EFFICIENT AND COST EFFECTIVE NETWORK HARDENING USING ATTACK GRAPHS

AUTHOR: M. Albanese, S. Jajodia, and S. Noel

PUBLISH: Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, june 2012

EXPLANATION:

Attack graph analysis has been established as a powerful tool for analyzing network vulnerability. However, previous approaches to network hardening look for exact solutions and thus do not scale. Further, hardening elements have been treated independently, which is inappropriate for real environments. For example, the cost for patching many systems may be nearly the same as for patching a single one. Or patching a vulnerability may have the same effect as blocking traffic with a firewall, while blocking a port may deny legitimate service. By failing to account for such hardening interdependencies, the resulting recommendations can be unrealistic and far from optimal. Instead, we formalize the notion of hardening strategy in terms of allowable actions, and define a cost model that takes into account the impact of interdependent hardening actions. We also introduce a near-optimal approximation algorithm that scales linearly with the size of the graphs, which we validate experimentally.

MINIMUM-COST NETWORK HARDENING USING ATTACK GRAPHS

AUTHOR: L. Wang, S. Noel, and S. Jajodia

PUBLISH: Computer Communications, vol. 29, no. 18, pp. 3812–3824, Nov. 2006. [Online]. Available: http://dx.doi.org/10.1016/j.comcom.2006.06.018

EXPLANATION:

In defending one’s network against cyber attack, certain vulnerabilities may seem acceptable risks when considered in isolation. But an intruder can often infiltrate a seemingly well-guarded network through a multi-step intrusion, in which each step prepares for the next. Attack graphs can reveal the threat by enumerating possible sequences of exploits that can be followed to compromise given critical resources. However, attack graphs do not directly provide a solution to remove the threat. Finding a solution by hand is error-prone and tedious, particularly for larger and less secure networks whose attack graphs are overly complicated. In this paper, we propose a solution to automate the task of hardening a network against multi-step intrusions. Unlike existing approaches whose solutions require removing exploits, our solution is comprised of initially satisfied conditions only. Our solution is thus more enforceable, because the initial conditions can be independently disabled, whereas exploits are usually consequences of other exploits and hence cannot be disabled without removing the causes. More specifically, we first represent given critical resources as a logic proposition of initial conditions. We then simplify the proposition to make hardening options explicit. Among the options we finally choose solutions with the minimum cost. The key improvements over the preliminary version of this paper include a formal framework of the minimum network hardening problem, and an improved one-pass algorithm in deriving the logic proposition while avoiding logic loops.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Further, the optimization formulations ignore the issue of resource availability while analyzing a risk model management framework using Bayesian networks that enable a system administrator to quantify the chances of network compromise at various levels to use this information to develop a security mitigation and management plan. In contrast to other similar models, this risk model lends itself to dynamic analysis during the deployed phase of the network. A multi objective optimization platform provides the administrator with all trade-off information required to make decisions in a resource constrained environment.

2.1.1 DISADVANTAGES:

Except for a few attempts previous work has solely focused on computing a numerical representation of the risk without addressing the more challenging problem of risk management and reduction.

Assume a network with known and fixed configurations in terms of the connectivity, availability and policies of the network services and components disregarding the dynamic nature of modern networks.

None of the previous work considers the effect of device availability on open networks. Furthermore, optimized network configurations and improvement in our work has not been previously studied.

Bayesian methods are powerful in computing unobserved facts, such as predicting possible threats. It remains unclear how Bayesian methods can be used to support variability in attacker’s decisions, device availability, and the effect of mobile devices.

2.2 PROPOSED SYSTEM:

We present a rigorous probabilistic model that measures the security risk as the probability of success in an attack. Our new contribution is to define this optimization problem and provide an efficient algorithm based on a standard technique called sequential linear programming. Our algorithm is proved to converge and it is scalable to large networks with thousands of components and attack paths.

Our experiments confirm three key properties of our model.

First, the vulnerability values computed from our model are accurate. Our manual inspection of the results confirms that the probability values obtained in the experiments correlate to the vulnerabilities of components in the network. Second, our security improvement method efficiently finds the optimal placement of security products subject to constraints. Third, we quantify the additional vulnerabilities introduced by mobile devices of a dynamic network. Our results indicate that an infected mobile device within the trusted region creates a preferred attack direction towards the attack target, which increases the chance of success at the target host. Our implementation efficiently computes the probabilities throughout large attack graphs with a quadratic execution performance.

2.2.1 ADVANTAGES:

Our probabilistic model referred to as the success measurement model main features:

Rigorous and scalable model with a clear probabilistic semantic, Computation of risk probabilities with the goal of finding the maximum attack capabilities.

Efficient security optimization model, generated based on a quantified attack graph, to compute an optimal placement of security products according to organizational and technical constraints.

Considering dynamic network features and the availability of mobile devices in the network as an application of our success measurement model, we formalize the problem of utilizing network.

Security resources as an optimization problem with the goal of computing an optimal placement of security products across a network. Modeling dynamic network features for a realistic and accurate analysis of the risk associated with modern networks.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MS-Access 2007
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:




		Nearest Router

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

ECSA ATTACK MODEL

Our probabilistic quantification model, referred to as success measurement model, quantifies the vulnerabilities of networked components and resources, by computing the expected chance of successful attack (ECSA) at every attack step, which is represented by an attack graph node. Our security improvement model uses the computed probabilities from the success measurement model to find optimal security defense strategies given a set of available options in the success measurement model requires three sets of inputs, which are a set of attack steps, a set of network configuration and potential vulnerabilities, and a set of ground facts. The first set includes the steps necessary to execute a targeted attack in a network.

These steps represent intermediate attack goals such as compromising a machine that has an internal connectivity with a targeted server. In addition, the attack steps also describe the various parallel choices available to an attack when achieving a specific target. The second set includes the network configurations and vulnerability data that collectively provide host software installations, inter host connectivity, running services and connections, and known or potential software vulnerabilities. The third set contains the ground fact values that describe the vulnerability, availability, and connectivity of various network configurations.

In our implementation, the first two sets of inputs (i.e., the attack steps and the network configuration data) are taken from dependency attack graphs. The system administrators use vulnerability assessment tools to explore the configurations and vulnerability data in their networks. The output of such assessment is provided as an input to attack graph generation tools. Attack graph generation tools (such as MulVAL often include customized predefined attack step rules that are applied to the configurations and vulnerability data of a network and produce a plain (that is, not quantified) attack graph.

Our model is to develop a set of ground fact values bootstrap the computation of success probabilities throughout an attack graph. The output of the computation based on our success measurement model is the input to the security optimization model (Figure 1). Using the security improvement model, we transform the quantified attack graph from the success measurement model into a mathematical program.

The resulting mathematical program includes an additional set of data that represent various network security defense strategies. In the tool that we developed, the security administrators simply feed this information as logical predicates such as ips_installed(T, E), which describes a potential installation of an intrusion prevention system of type T and security effectiveness E. The effectiveness value E is a score estimated by the system administrator based on prior experiences and available effectiveness data.

We present our success measurement model to compute the expected chance of a successful attack on a network with respect to the attack’s ultimate goal. We first present the definitions of the expected chance of a successful attack (ECSA) followed by the description of an efficient method to compute ECSA values. Our success measurement model computes probabilities as a function of initial belief probabilities without the need for specifying conditional probabilities required by Bayes’ theorem. Our model measures the success of an attacker based on the attack dependencies determined by a logical attack graph.

4.1 ALGORITHM

GNU LINEAR PROGRAMMING KIT

We implemented a tool for our computational procedures (Section 4.3) in Java (with approximately 3500 lines of code). We use (GNU Linear Programming Kit) GLPK, a well known open source linear programming API for our SLP-based procedure. Our tool parses an attack graph input file (obtained from MulVAL, computes the ECSA values according to various parameters, and performs security improvement analysis based on a set of improvement options and constraints.

We demonstrate the performance of our implementation. For each graph, we repeat the corresponding experiment to measure the time to compute the final expected chance of a successful attack at the graph’s root vertex. We compute ECSA values for the target graphs using our tool. We run our tool as a single threaded program on a machine with a 2.4 GHz Intel Core i7 processor and a 8 GB DDR3 memory. All our experiments converged with at most 20 iterations towards the solution. On average, 87.99% of the execution time for Procedure 2 is spent on the Taylor expansion from which on average 78.27% of the execution time is spent on symbolic differentiation performed using DJep1 Java library for symbolic operations. The Taylor expansion is parallelizable, and scales with the number of vertices, hence can be done efficiently offline.

SLP LINEAR ALGORITHM

For a network configuration w, let Gw be the corresponding attack graph. The complete procedure to compute the ECSA values of nodes (Definition 2) for an attack graph (Definition 1) is given next. To prepare the attack graph for computation, we execute the following procedure. Our procedureis a technique called sequential linear programming (SLP). SLP is a standard technique for solving nonlinear optimization problems, which is found to be computationally efficient and converges to an optimal solution.

4.2 MODULES:

NETWORK SECURITY:

PROBABILISTIC MODEL:

GENERATING ATTACK GRAPH:

SECURITY OPTIMIZATION:

4.3 MODULE DESCRIPTION:

NETWORK SECURITY:

Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.

PROBABILISTIC MODEL:

Our probabilistic model referred to as the success measurement model has three main features: (i) rigorous and scalable model with a clear probabilistic semantic, (ii) computation of risk probabilities with the goal of finding the maximum attack capabilities, and (iii) considering dynamic network features and the availability of mobile devices in the network.

Probabilistic risk assessment is to accurately capture attack step dependencies and correlations. Attack dependencies in the form of attack preconditions are intrinsically captured by our model. That is because we base our analysis on attack graphs that are formed based on the dependency relations among the nodes. Therefore, the probabilities of success are computed by considering the dependency relations determined in an attack graph.

The focus of our experiments is to practically demonstrate the practicality, feasibility, and accuracy of the model.

Our experiments include novel features such as analyzing networks with less studied but potentially vulnerable devices such as mobile devices and networked printers. To the best of our knowledge, the experiments in the network analysis literature lack this level of detail.

Our model will give system administrators a solid analysis of the security in their networks that will assist in actual implementation of security features to downgrade the possibility of successful attack.

GENERATING ATTACK GRAPH:

Attack graph has several goal nodes dependencies is a logical disjunction. In reality, this disjunction indicates that there are multiple attack choices for an attacker towards a specific attack goal. For instance, consider a server with a local privilege escalation vulnerability (which is exploitable remotely in a multistep attack) and runs a network service with multiple remote vulnerabilities. An attacker must exploit one (or more) of these vulnerabilities to gain privileges on the target server. In the lack of observable evidence, one needs to compute the ECSA of a goal node with a function that correctly captures the probabilities of such attack choices. Our approach is to computationally determine attack choice probabilities according to various attack patterns.

SECURITY OPTIMIZATION:

To achieve our main research goal of reducing the probability of success in an attack, and thus optimizing the overall security of the network, we point out the necessity to model this problem as an optimization problem. Further, we attempt to model an important feature that is to consider the availability of machines in the network. In this section we describe these two contributions of our work as summarized below.

Optimizing the security of the networks given a set of security hardening products (e.g., a host based firewall), we compute an optimal distribution of these resources subject to given placement constraints. Using the rigorous probabilistic model introduced in Section 4.1, this is the first work in which a logical attack graph (Definition 1) is transformed into a system of linear and nonlinear equations with the global objective of reducing the probability of success on the graph’s ultimate attack goal. This transformation is performed efficiently and naturally and directly captures our research goal.

Machine availability and the effect of mobile devices:

Our work is the first to show how to represent and assess devices with variable availability (frequently joining and leaving the network), which is one of the characteristics of mobile devices with variable connectivity. Resources for hardening an organizational network, it is important to install a single or a combination of security hardening products so that the expected chance of a successful attack on the network is minimized. To find the best placement of a set of security products in a network, we extend the attack graph to define a security product as a special fact node referred to as an improvement node, which is a fact node that represents a security hardening product, service, practice, or policy. The objective of solving the problem of optimal placement of security products is to compute the effects of various placements of one or more improvement nodes subject to certain constraints and choose the placement that minimizes the attack goal’s ECSA value.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 8

8.1 CONCLUSION & FUTURE WORK:

In this work we formalized, implemented, and evaluated a new probabilistic model for measuring the security threats in large enterprise networks. The novelty of our work is the ability to quantitatively analyze the chance of successful attack in the presence of uncertainties about the configuration of a dynamic network and routes of potential attacks.

Our results indicate that an infected mobile device within the trusted region creates a preferred attack direction towards the attack target, which increases the chance of success at the target host. Our implementation efficiently computes the probabilities throughout large attack graphs with a quadratic execution performance.

For future work, we plan to utilize and extend our success measurement model and optimal security placement algorithm to solve more complex network security optimization problems. For instance, an important issue is noise elimination in the initial belief set of values. This is an important problem that if solved will lead to the production of more accurate results.

Secure and Distributed Data Discovery and Dissemination in Wireless Sensor Networks

05/08/201902/07/2019 by admin

—A data discovery and dissemination protocol for wireless sensor networks (WSNs) is responsible for updatingconfiguration parameters of, and distributing management commands to, the sensor nodes. All existing data discovery anddissemination protocols suffer from two drawbacks. First, they are based on the centralized approach; only the base station candistribute data items. Such an approach is not suitable for emergent multi-owner-multi-user WSNs. Second, those protocols werenot designed with security in mind and hence adversaries can easily launch attacks to harm the network. This paper proposes thefirst secure and distributed data discovery and dissemination protocol named DiDrip. It allows the network owners to authorizemultiple network users with different privileges to simultaneously and directly disseminate data items to the sensor nodes.Moreover, as demonstrated by our theoretical analysis, it addresses a number of possible security vulnerabilities that we haveidentified. Extensive security analysis show DiDrip is provably secure. We also implement DiDrip in an experimental network ofresource-limited sensor nodes to show its high efficiency in practice.Index Terms—Distributed data discovery and dissemination, security, wireless sensor networks, efficiencyÇ1 INTRODUCTIONAFTER a wireless sensor network (WSN) is deployed,there is usually a need to update buggy/old small programsor parameters stored in the sensor nodes. This can beachieved by the so-called data discovery and dissemination protocol,which facilitates a source to inject small programs,commands, queries, and configuration parameters to sensornodes. Note that it is different from the code disseminationprotocols (also referred to as data dissemination or reprogrammingprotocols) [1], [2], which distribute large binariesto reprogram the whole network of sensors. For example,efficiently disseminating a binary file of tens of kilobytesrequires a code dissemination protocol while disseminatingseveral 2-byte configuration parameters requires data discoveryand dissemination protocol. Considering the sensornodes could be distributed in a harsh environment,remotely disseminating such small data to the sensor nodesthrough the wireless channel is a more preferred and practicalapproach than manual intervention.In the literature, several data discovery and disseminationprotocols [3], [4], [5], [6] have been proposed for WSNs.Among them, DHV [3], DIP [5] and Drip [4] are regarded asthe state-of-the-art protocols and have been included in theTinyOS distributions. All proposed protocols assume thatthe operating environment of the WSN is trustworthy andhas no adversary. However, in reality, adversaries exist andimpose threats to the normal operation of WSNs [7], [8].This issue has only been addressed recently by [7] whichidentifies the security vulnerabilities of Drip and proposesan effective solution.More importantly, all existing data discovery and disseminationprotocols [3], [4], [5], [6], [7] employ the centralizedapproach in which, as shown in the top sub-figurein Fig. 1, data items can only be disseminated by the basestation. Unfortunately, this approach suffers from the singlepoint of failure as dissemination is impossible whenthe base station is not functioning or when the connectionbetween the base station and a node is broken. In addition,the centralized approach is inefficient, non-scalable, andvulnerable to security attacks that can be launched anywherealong the communication path [2]. Even worse,some WSNs do not have any base station at all. For example,for a WSN monitoring human trafficking in acountry’s border or a WSN deployed in a remote area tomonitor illicit crop cultivation, a base station becomes anattractive target to be attacked. For such networks, datadissemination is better to be carried out by authorized networkusers in a distributed manner.Additionally, distributed data discovery and disseminationis an increasingly relevant matter in WSNs, especiallyin the emergent context of shared sensor networks, wheresensing/communication infrastructures from multiple ownerswill be shared by applications from multiple users. Forexample, large scale sensor networks are built in recent_ D. He is with the School of Computer Science and Engineering, South ChinaUniversity of Technology, Guangzhou 510006, P.R. China, and also withthe College of Computer Science and Technology, Zhejiang University,Hangzhou 310027, P.R. China. E-mail: hedaojinghit@gmail.com._ S. Chan is with the Department of Electronic Engineering, City Universityof Hong Kong, Hong Kong SAR, P. R. China.E-mail: eeschan@cityu.edu.hk._ M. Guizani is with Qatar University, Qatar. E-mail: mguizani@ieee.org._ H. Yang is with the School of Computer Science and Engineering, Universityof Electronic Science and Technology of China, P. R. China.E-mail: haomyang@uestc.edu.cn._ B. Zhou is with the College of Computer Science, Zhejiang University,P. R. China. E-mail: zby@zju.edu.cn.Manuscript received 31 Dec. 2013; revised 29 Mar. 2014; accepted 31 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.recommended for acceptance by V. B. Misic.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316830IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 11291045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.projects such as Geoss [9], NOPP [10] and ORION [11].These networks are owned by multiple owners and used byvarious authorized third-party users. Moreover, it isexpected that network owners and different users may havedifferent privileges of dissemination. In this context, distributedoperation by networks owners and users with differentprivileges will be a crucial issue, for which efficient solutionsare still missing.Motivated by the above observations, this paper has thefollowing main contributions:1) The need of distributed data discovery and disseminationprotocols is not completely new, but previouswork did not address this need. We study the functionalrequirements of such protocols, and set theirdesign objectives. Also, we identify the security vulnerabilitiesin previously proposed protocols.2) Based on the design objectives, we propose DiDrip.It is the first distributed data discovery and disseminationprotocol, which allows network owners andauthorized users to disseminate data items intoWSNs without relying on the base station. Moreover,our extensive analysis demonstrates thatDiDrip satisfies the security requirements of theprotocols of its kind. In particular, we apply theprovable security technique to formally provethe authenticity and integrity of the disseminateddata items in DiDrip.3) We demonstrate the efficiency of DiDrip in practiceby implementing it in an experimental WSN withresource-limited sensor nodes. This is also the firstimplementation of a secure and distributed data discoveryand dissemination protocol.The rest of this paper is structured as follows. In Section2, we first survey the existing data discovery anddissemination protocols, and then discuss their securityweaknesses. Section 3 describes the requirements for asecure and distributed extension of such protocols. Section4 presents the network, trust and adversary models.Section 5 describes DiDrip in details. Section 6 providestheoretical analysis of the security properties of DiDrip.Section 7 describes the implementation and experimentalresults of DiDrip via real sensor platforms. Finally,Section 8 concludes this paper.2 SECURITY VULNERABILITIES IN DATADISCOVERY AND DISSEMINATION2.1 Review of Existing ProtocolsThe underlying algorithm of both DIP and Drip is Trickle[12]. Initially, Trickle requires each node to periodicallybroadcast a summary of its stored data. When a node hasreceived an older summary, it sends an update to thatsource. Once all nodes have consistent data, the broadcastinterval is increased exponentially to save energy. However,if a node receives a new summary, it will broadcast thismore quickly. In other words, Trickle can disseminatenewly injected data very quickly. Among the existing protocols,Drip is the simplest one and it runs an independentinstance of Trickle for each data item.In practice, each data item is identified by a unique keyand its freshness is indicated by a version number. Forexample, for Drip, DIP and DHV, each data item is representedby a 3-tuple <key; version; data> , where key is usedto uniquely identify a data item, version indicates the freshnessof the data item (the larger the version, the fresher thedata), and data is the actual disseminated data (e.g., command,query or parameter).2.2 Security VulnerabilitiesAn adversary can first place some intruder nodes in the networkand then use them to alter the data being disseminatedor forge a data item. This may result in some importantparameters being erased or the entire network beingrebooted with wrong data. For example, consider a new dataitem (key, version, data) being disseminated. When anintruder node receives this new data item, it can broadcast amalicious data item (key, version_, data_), where version_ >version. If data_ is set to 0, the parameter identified by key willbe erased from all sensor nodes. Alternatively, if data_ is differentfrom data, all sensor nodes will update the parameteraccording to this forged data item. Note that the aboveattacks can also be launched if an adversary compromisessome nodes and has access to their key materials.In addition, since nodes executing Trickle are required toforward all new data items that they receive, an adversarycan launch denial-of-service (DoS) attacks to sensor nodesby injecting a large amount of bogus data items. As a result,the processing and energy resources of nodes are expendedFig. 1. System overview of centralized and distributed data discovery and dissemination approaches.1130 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015to process and forward these bogus data items, rather thanon the intended functions. Any data discovery and disseminationprotocol based on Trickle or its variants is vulnerableto such a DoS attack.3 REQUIREMENTS AND DESIGN CONSIDERATIONA secure and distributed data discovery and disseminationprotocol should satisfy the following requirements:1) Distributed. Multiple authorized users should beallowed to simultaneously disseminate data itemsinto the WSN without relying on the base station.2) Supporting different user privileges. To provide flexibility,each user may be assigned a certain privilegelevel by the network owner. For example, a user canonly disseminate data items to a set of sensor nodeswith specific identities and/or in a specific localizedarea. Another example is that a user just has the privilegeto disseminate data items identified by somespecific keys.3) Authenticity and integrity of data items. A sensor nodeonly accepts data items disseminated by authorizedusers. Also, a sensor should be able to ensure thatreceived data items have not been modified duringthe dissemination process.4) User accountability. User accountability must be providedsince bad user behaviors and insider attacksshould be audited and pinpointed. That is, a sendershould not be able to deny the distribution of a dataitem. At the same time, an adversary cannot impersonateany legitimate user even if it has compromisedthe network owner or the other legitimateusers. In many applications, accountability is desirableas it enables collection of users’ activities. Forexample, from the dissemination record in sensornodes, the network owner can find out who disseminatesmost data. This requires the sensor nodes to beable to associate each disseminated data with thecorresponding user’s identity.5) Node compromise tolerance. The protocol should beresilient to node compromise attack no matter howmany nodes have been compromised, as long as thesubset of non-compromised nodes can still form aconnected graph with the trusted source.6) User collusion tolerance. Even if an adversary has compromisedsome users, a benign node should notgrant the adversary any privilege level beyond thatof the compromised users.7) DoS attacks resistance. The functions of the WSNshould not be disrupted by DoS attacks.8) Freshness. A node should be able to differentiatewhether an incoming data item is the newestversion.9) Low energy overhead. Most sensor nodes have limitedresources. Thus, it is very important that the securityfunctions incur low energy overhead, which can bedecomposed to communication and computationoverhead.10) Scalability. The protocol should be efficient even forlarge-scale WSNs with thousands of sensors andlarge user population.11) Dynamic participation. New sensor nodes and userscan be dynamically added to the network.In order to ensure security, each step of the existing datadiscovery and dissemination protocol runs should be identifiedand then protected. In other words, although code disseminationprotocols may share the same securityrequirements as listed above, their security solutions needto be designed in accordance with their characteristics. Consideringthe well known open-source code disseminationprotocol Deluge [1] as an example. Deluge uses an epidemicprotocol based on a page-by-page dissemination strategyfor efficient advertisement of metadata. A code image isdivided into fixed-size pages, and each page is further splitinto same-size packets. Due to such a way of decomposingcode images into packets, our proposed protocol is notapplicable for securing Deluge.The primary challenge of providing security functions inWSNs is the limited capabilities of sensor nodes in terms ofcomputation, energy and storage. For example, to provideauthentication function to disseminated data, a commonlyused solution is digital signature. That is, users digitallysign each packet individually and nodes need to verify thesignature before processing it. However, such an asymmetricmechanism incurs significant computational and communicationoverhead and is not applicable to sensor nodes.To address this problem, TESLA and its various extensionshave been proposed [13], [14], which are based on thedelayed disclosure of authentication keys, i.e., the key usedto authenticate a message is disclosed in the next message.Unfortunately, due to the authentication delay, these mechanismsare vulnerable to a flooding attack which causeseach sensor node to buffer all forged data items until thedisclosed key is received.Another possible approach to authentication is by symmetrickey cryptography. However, this approach is vulnerableto node compromise attack because once a node iscompromised, the globally shared secret keys are revealed.Here we choose digital signatures over other forms forupdate packet authentication. That is, the network ownerassigns to each network user a public/private key pair thatallows the user to digitally sign data items and thus authenticateshimself/herself to the sensor nodes. We propose twohybrid approaches to reduce the computation and communicationcost. These methods combine digital signature withefficient data Merkle hash tree and data hash chain, respectively.The main idea is that signature generation and verificationare carried out over multiple packets instead ofindividual packet. In this way, the computation cost perpacket is significantly reduced. Since elliptic curve cryptography(ECC) is computational and communication efficientcompared with the traditional public key cryptography,DiDrip is based on ECC.To prevent the network owner from impersonatingusers, user certificates are issued by a certificate authority ofa public key infrastructure (PKI), e.g., local police office.4 NETWORK, TRUST AND THREAT MODELS4.1 Network ModelAs shown in the bottom subfigure in Fig. 1, a generalWSN comprises a large number of sensor nodes. It isHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1131administrated by the owner and accessible by many users.The sensor nodes are usually resource constrained withrespect to memory space, computation capability, bandwidth,and power supply. Thus, a sensor node can onlyperform a limited number of public key cryptographicoperations during the lifetime of its battery. The networkusers use some mobile devices to disseminate data itemsinto the network. The network owner is responsible for generatingkeying materials. It can be offline and is assumed tobe uncompromisable.4.2 Trust ModelNetworks users are assigned dissemination privileges bythe trusted authority in a PKI on behalf of the networkowner. However, the network owner may, for various reasons,impersonate network users to disseminate data items.4.3 Threat ModelThe adversary considered in this paper is assumed to becomputationally resourceful and can launch a wide rangeof attacks, which can be classified as external or insiderattacks. In external attacks, the adversary has no control ofany sensor node in the network. Instead, it would eavesdropfor sensitive information, inject forged messages,launch replay attack, wormhole attacks, DoS attacks andimpersonate valid sensor nodes. The communication channelmay also be jammed by the adversary, but this can onlylast for a certain period of time after which the adversarywill be detected and removed.By compromising either network users or sensor nodes,the adversary can launch insider attacks to the network. Thecompromised entities are regarded as insiders because theyare members of the network until they are identified. Theadversary controls these entities to attack the network in arbitraryways. For instance, they could be instructed to disseminatefalse or harmful data, launch attacks such as Sybil attacksor DoS attacks, and be non-cooperative with other nodes.5 DIDRIPReferring to the lower sub-figure in Fig. 1, DiDrip consists offour phases, system initialization, user joining, packet preprocessingand packet verification. For our basic protocol,in system initialization phase, the network owner creates itspublic and private keys, and then loads the public parameterson each node before the network deployment. In theuser joining phase, a user gets the dissemination privilegethrough registering to the network owner. In packet preprocessingphase, if a user enters the network and wants todisseminate some data items, he/she will need to constructthe data dissemination packets and then send them to thenodes. In the packet verification phase, a node verifies eachreceived packet. If the result is positive, it updates the dataaccording to the received packet. In the following, eachphase is described in detail. The notations used in thedescription are listed in Table 1. The information processingflow of DiDrip is illustrated in Fig. 2.5.1 System Initialization PhaseIn this phase, an ECC is set up. The network owner carriesout the following steps to derive a private key x and somepublic parameters fy; Q; p; q; hð:Þg. It selects an elliptic curveE over GFðpÞ, where p is a big prime number. Here Qdenotes the base point ofE while q is also a big prime numberand represents the order of Q. It then selects the private keyx 2 GFðqÞ and computes the public key y ¼ xQ. After that,the public parameters are preloaded in each node of the network.We consider 160-bit ECC as an example. In this case, yandQ are both 320 bits long while p and q are 160 bits long.5.2 User Joining PhaseThis phase is invoked when a user with the identity UIDj,say Uj, hopes to obtain privilege level. User Uj chooses theprivate key SKj 2 GFðqÞ and computes the public keyPKj ¼ SKj_Q. Here the length of UIDj is set to 2 bytes, inthis case, it can support 65,536 users. Similarly, assume that160-bit ECC is used, PKj and SKj are 320 bits and 160 bitslong, respectively. Then user Uj sends a 3-tuple <UIDj;Prij; PKj > to the network owner, where Prij denotes thedissemination privilege of user Uj. Upon receiving this message,the network owner generates the certificate Certj. Aform of the certificate consists of the following contents:Certj ¼ fUIDj; PKj; Prij; SIGxfhðUIDjkPKjkPrijÞg, wherethe length of Prij is set to 6 bytes, thus the length of Certj is88 bytes.TABLE 1NotationsFig. 2. Information processing flow in DiDrip.1132 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 20155.3 Packet Pre-Processing PhaseAssume that a user, say Uj, enters the WSN and wants todisseminate n data items: di ¼ fkeyi; versioni; dataig, i ¼ 1;2; . . .; n. For the construction of the packets of the respectivedata, we have two methods, i.e., data hash chain and theMerkle hash tree [15].For data hash chain approach, a packet, say Pi is composedof packet header, di, and the hash value of packet Piþ1(i.e., Hiþ1 ¼ hðPiþ1Þ) which is used to verify the next packet,where i ¼ 1; . . .; n _ 1. Here each cryptographic hash Hi iscalculated over the full packet Pi, not just the data portion di,thereby establishing a chain of hashes. After that, user Ujuses his/her private key SKj to run an ECDSA sign operationto sign the hash value of the first data packet hðP1Þ and thencreates an advertisement packet P0, which consists of packetheader, user certificate Certj, hðP1Þ and the signatureSIGSKjfhðP1Þg. Similarly, the network owner assigns a predefinedkey to identify this advertisement packet.With the method of Merkle hash tree, user Uj builds aMerkle hash tree from the n data items in the following way.All the data items are treated as the leaves of the tree. A newset of internal nodes at the upper level is formed; each internalnode is computed as the hash value of the concatenationof two child nodes. This process is continued until the rootnode Hroot is formed, resulting in a Merkle hash tree withdepth D ¼ log2ðnÞ. Before disseminating the n data items,user Uj signs the root node with his/her private key SKj andthen transmits the advertisement packet P0 comprising usercertificate Certj,Hroot and SIGSKjfHrootg. Subsequently, userUj disseminates each data item along with the appropriateinternal nodes for verification purpose. Note that asdescribed above, user certificate Certj contains user identityinformation UIDj and dissemination privilege Prij. Beforethe network deployment, the network owner assigns a predefinedkey to identify this advertisement packet.5.4 Packet Verification PhaseWhen a sensor node, say Sj, receives a packet either from anauthorized user or from its one-hop neighbours, it firstchecks the packet’s key field1) If this is an advertisement packet (P0 fCertj; hðP1Þ;SIGSKjfhðP1Þgg for the data hash chain method whileP0 ¼ fCertj; root; SIGSKjfrootgg for the Merkle hashtree method), node Sj first pays attention to the legalityof the dissemination privilege Prij. For example,node Sj needs to check whether the identity of itself isincluded in the node identity set of Prij. If the resultis positive, node Sj uses the public key y of the networkowner to run an ECDSA verify operation toauthenticate the certificate. If the certificate Certj isvalid, node Sj authenticates the signature. If yes, forthe data hash chain method (respectively, the Merklehash tree method), node Sj stores <UIDj;H1 >(respectively, <UIDj; root>) included in the advertisementpacket; otherwise, node Sj simply discardsthe packet.2) Otherwise, it is a data packet Pi, where i ¼ 1; 2; . . . ;n. Node Sj executes the following procedure:For the data hash chain method, node Sj checks theauthenticity and integrity of Pi by comparing the hashvalue of Pi with Hi which has been received in thesame round and verified. If the result is positive andthe version number is new, node Sj then updates thedata identified by the key stored in Pi and replaces itsstored <round, Hi > by <round, Hiþ1 > (hereHiþ1 isincluded in packet Pi); otherwise, Pi is discarded.For Merkle hash tree method, node Sj checks theauthenticity and integrity of Pi through the alreadyverified root node received in the same round. If theresult is positive and the version number is new, nodeSj then updates the data identified by the key storedin Pi; otherwise, Pi is discarded.Remark: To prevent the network owner from impersonatingusers, system initialization and issue of user certificatescan be carried out by the certificate authority of aPKI rather than the network owner.Comparing the two methods, the data hash chainmethod incurs less communication overhead than the Merklehash tree method. In the data hash chain method, onlyone hash value of a packet is included in each packet. Onthe contrary, in the Merkle hash tree method, D (the treedepth) hash values are included in each packet. However, alimitation of the data hash tree method is that it just workswell in networks with in-sequence packet delivery. Such alimitation does not exist in the Merkle hash tree methodsince it allows each packet to be immediately authenticatedupon its arrival at a node. Therefore, the choice of eachmethod depends on this characteristic of the WSNs.5.5 EnhancementsWe can enhance the efficiency and security of DiDrip byadding additional mechanisms. Readers are referred to theAppendix of this paper for details.6 SECURITY ANALYSIS OF DIDRIPIn the following, we will analyze the security of DiDrip toverify that the security requirements mentioned in Section 3are satisfied.Distributed. As described in Section 5.2, in order to passthe signature verification of sensor nodes, each user has tosubmit his/her private key and dissemination privilege tothe network owner for registration. In addition, as describedabove, authorized users are able to carry out disseminationin a distributed manner.Supporting different user privileges. Activities of networkusers can be restricted by setting user privilege Prij, whichis contained in the user certificate. Since each user certificateis generated based on Prij, it will not pass the signature verificationat sensor nodes if Prij is modified. Thus, only thenetwork owner can modify Prij and then updates the certificateaccordingly.Authenticity and integrity of data items. With the Merklehash tree method (rsp. the data hash chain method), anauthorized user signs the root of the Merkle hash tree (rsp.the hash value of the first data packet hðP1Þ) with his/her privatekey. Using the network owner’s public key, each sensornode can authenticate the user certificate and obtains theuser’s public key. Then, using the user’s public key, eachnode can authenticate the root of the Merkle hash tree (rsp.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1133the hash value of the first data packet hðP1Þ). Subsequently,each node can authenticate other data packets based on theMerkle hash tree (rsp. the data hash chain). With the assumptionthat the network owner cannot be compromised, it isguaranteed that any forged or modified data items can beeasily detected by the authentication process.User accountability. Users’ identities and their disseminationactivities are exposed to sensor nodes. Thus, sensornodes can report such records to the network owner periodically.Since each user certificate is generated according tothe user identity, except the network owner, no one canmodify the user identity contained in the user certificatewhich passes the authentication. Therefore, users cannotrepudiate their activities.Node compromise and user collusion tolerance. As describedabove, for basic protocol, only the public parameters arepreloaded in each node. Even for the improved protocol,the public-key/dissemination-privilege pair of each networkuser is loaded into the nodes. Therefore, no matterhow many sensor nodes are compromised, the adversaryjust obtains the public parameters and the public-key/dissemination-privilege pair of each user. Clearly, the adversarycannot launch any attack by compromising sensornodes. As described in Section 5, even if some users collude,a benign node will not grant any dissemination privilegethat is beyond those of colluding users.Resistance to DoS attacks. There are DoS attacks against basicDiDrip by exploiting: (1) authentication delays, (2) the expensivesignature verifications, and (3) the Trickle algorithm.First, with the use of Merkle hash tree or data hash chain,each node can efficiently authenticate a data packet by afew hash operations. Second, using the message specificpuzzle approach, each node can efficiently verify a puzzlesolution to filter a fake signature message and to forward adata packet using Trickle without waiting for signature verification.Therefore, all the above DoS attacks are defended.DiDrip can successfully defeat all three types of DoSattacks even if there are compromised network users andsensor nodes. Indeed, without the private key and the unreleasedpuzzle keys of the network users, even an insideattacker cannot forge any signature/data packets.Ensurance of freshness. If the privilege of a user allows him/her to disseminate data items to his/her own set of nodes, theversion number in each item can ensure the freshness ofDiDrip. On the other hand, if a node receives data items frommultiple users, the version number can be replaced by a timestampto indicate the freshness of a data item. More specifically,a timestamp is attached into the root of the Merkle hashtree (or the hash value of the first data packet).Scalability. Different from centralized approaches, anauthorized user can enter the network and then disseminatesdata items into the targeted sensor nodes. Moreover,as to be demonstrated by our experiments in a testbed with24 TelosB motes in the next section, the security functions inour protocol have low impact on propagation delay. Notethat the increase in propagation delay is dominated by thesignature verification time incurred at the one-hop neighboringnodes of the authorized user. Thus, the proposedprotocol is efficient even in a large-scale WSN with thousandsof sensor nodes. Also, as shown in Section 5.2, ourprotocol can support a large number of users.Also, as described above, DiDrip can achieve dynamic participation.Moreover, in the next section, our implementationresults will demonstrate DiDrip has lowenergy overhead.In the following, we give the formal proof of the authenticityand integrity of the disseminated data items in DiDripbased on the three assumptions below:Assumption 1: There exist pseudo-random functionswhich are polynomially indistinguishable from truly randomfunctions.Assumption 2: There exist target collision-resistance(TCR) hash functions [16], where if for all probabilistic-polynomial-time (PPT) adversaries, say A, A have negligibleprobability in winning the following game: A first choose amessage m, and then A are given a random function hð:Þ. Towin, A must output m0 6¼m such that hðm0Þ ¼ hðmÞ. Notethat in our scheme, the TCR hash function can be implementedby the common hash functions, such as SHA-1.Assumption 3: ECC signature is existentially unforgeableunder adaptive chosen-message attacks. Note thatin our scheme, ECC signature can use the standardECDSA of 160 bits.Theorem 1. DiDrip achieves the authenticity and integrity of dataitems, assuming the indistinguishability between pseudo-randomnessand true randomness, and assuming that hð:Þ is aTCR hash function and ECC signature is existentially unforgeableunder adaptive chosen-message attacks.Proof. Our theorem follows from Theorem A.1 in [16], andthus here we only give a proof sketch briefly. To beginwith, we assume ECC signature generation and verificationguarantee authenticity and integrity of signed messages,and every receiver has obtained an authentic copyof the legitimate sender’s public key. And we also assumethat hð:Þ is a TCR hash function. Therefore, here the securityof DiDrip is proven based on the indistinguishabilitybetween pseudo-randomness and true randomness.First, there exists a PPT adversary A, which can defeatauthenticity of data items in DiDrip. This means that Acontrols the communication links and manages, withnon-negligible probability, to deliver a message m to areceiver R, such that the sender S has not sent m but Raccepts m as authentic and coming from S. Then there isa PPT adversary B that uses A to break the indistinguishabilitybetween pseudo-randomness and true randomnesswith non-negligible advantage. That is, B getsaccess to h (as an oracle) and can tell with non-negligibleprobability if h is a pseudorandom function (PRFð:Þ) orif h is a totally random function.To this end, B can query on inputs x of its choice and beanswered with hðxÞ. Hence first B simulates for A a networkwith a sender S and a receiver R. Then B works byrunning A in the way similar to that in [17]. Namely, Bchooses a number l 2 f1; . . .; ng at random, where n is thetotal number of packets to be sent in the data dissemination.Note that B hopes thatAwill forge the lth packet Pl.B grants access to the oracle h, which is either PRFð:Þor an ideal random function. B can adaptively query anarbitrarily chosen x to the oracle and get the outputwhich is either PRFðxÞ or a random value uniformlyselected from f0; 1g_. After performing polynomiallymany queries, B finally makes the decision of whether or1134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015not the oracle is PRFð:Þ or the ideal random function. Asa result, B wins the game if the decision is correct.We argue that B succeeds with non-negligible probabilitysketchily. If h is a truly random function then A hasonly negligible probability to successfully forge thepacket Pl in the data items. Therefore, if h is random thenB makes the wrong decision only with negligible probability.On the other hand, we have assumed that if theauthentication is done using PRFð:Þ then A forges somepacket with non-negligible probability _. It follows that ifh is PRFð:Þ then B makes the right decision with probabilityat least _=l (which is also non-negligible).In addition, if the adversary A is able to cause areceiver R to accept a forged packet Pl, it implies theadversary A is able to find a collision on hðP0lÞ ¼ hðPlÞ inthe packet Pl_1. However, according to Assumption 2,hð:Þ is a TCR hash function. Moreover, due to the unforgeabilityof ECC signature (Assumption 3), it is impossiblethat A hands R a forged initial packet from S (Fordata hash chain approach, A forges the signatureSIGSKj ðP1Þ; For Merkle hash tree method, A forges thesignature SIGSKj ðrootÞ. Therefore, the above contradictionsmeans also that the authenticity and integrity ofdata items in DiDrip. tu7 IMPLEMENTATION AND PERFORMANCEEVALUATIONWe evaluate DiDrip by implementing all components on anexperimental test-bed. Also, we choose Drip for performancecomparison.7.1 Implementation and Experimental SetupWe have written programs that execute the functions of thenetwork owner, user and sensor node. The network ownerand user side programs are C programs using OpenSSL [18]and running on laptop PCs (with 2 GB RAM) under Ubuntu11.04 environment with different computational power.Also, the sensor node side programs are written in nesCand run on resource-limited motes (MicaZ and TelosB). TheMicaZ mote features an 8-bit 8-MHz Atmel microcontrollerwith 4-kB RAM, 128-kB ROM, and 512 kB of flash memory.Also, the TelosB mote has an 8-MHz CPU, 10-kB RAM, 48-kB ROM, 1MB of flash memory, and an 802.15.4/ZigBeeradio. Our motes run TinyOS 2.x. Additionally, SHA-1 isused, and the key sizes of ECC are set to 128 bits, 160 and192 bits, respectively. Throughout this paper, unlessotherwise stated, all experiments on PCs (respectively, sensornodes) were repeated 100,000 times (respectively,1,000 times) for each measurement in order to obtain accurateaverage results.To implement DiDrip with the data hash chain method(rsp. the Merkle hash tree method), the following functionalitiesare added to the user side program of Drip: constructionof data hash chain (rsp. Merkle hash tree) of around of dissemination data, generation of the signaturepacket and all data packets. For obtaining version numberof each data item, the DisseminatorC and DisseminatorPmodules in the Drip nesC library has been modified toprovide an interface called DisseminatorVersion. Moreover,the proposed hash tree method is implemented withoutand with using the message specific puzzle approachpresented in Appendix, resulting in two implementationsof DiDrip; DiDrip1 and DiDrip2. In DiDrip1, when a nodereceives a signature/data packet with a new version number,it authenticates the packet before broadcasting it to itsnext-hop neighbours. On the other hand, in DiDrip2, anode only checks the puzzle solution in the packet beforebroadcasting the packets. We summarize the pros andcons of all related protocols in Table 2.Based on the design of DiDrip, we implement the verificationfunction for signature and data packets based onthe ECDSA verify function and SHA-1 hash function ofTinyECC 2.0 library [19] and add them to the Drip nesClibrary. Also, in our experiment, when a network user(i.e., a laptop computer) disseminates data items, it firstsends them to the serial port of a specific sensor node inthe network which is referred to as repeater. Then, therepeater carries out the dissemination on behalf of theuser using DiDrip.Similar to [7], we use a circuit to accurately measure thepower consumption of various cryptographic operationsexecuted in a mote. The Tektronix TDS 3034C digital oscilloscopeaccurately measures the voltage Vr across the resistor.Denoting the battery voltage as Vb (which is 3 volts in ourexperiments), the voltage across the mote Vm is then Vb _ Vr.Once Vr is measured, the current through the circuit I canbe obtained by using Ohm’s law. The power consumed bythe mote is then VmI. By also measuring the execution timeof the cryptographic operation, we can obtain the energyconsumption of the operation by multiplying the powerand execution time.7.2 Evaluation ResultsThe following metrics are used to evaluate DiDrip; memoryoverhead, execution time of cryptographic operations andpropagation delay, and energy overhead. The memoryoverhead measures the required data space in the implementation.The propagation delay is defined as the timefrom construction of a data hash chain until the parameterson all sensor nodes corresponding to a round of disseminateddata items are updated.TABLE 2The Pros and Cons of All Related ProtocolsHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1135Table 3 shows the execution times of some importantoperations in DiDrip. For example, the execution times forthe system initialization phase and signing a random20-byte message (i.e., the output of SHA-1 function) are1.608 and 0.6348 ms on a 1.8-GHz Laptop PC, respectively.Thus, if SHA-1 is used, generating a user certificate or signinga message takes 0.6348 ms on a 1.8-GHz Laptop PC.Fig. 3 shows the execution times of SHA-1 hash function(extracted from TinyECC 2.0 [19]) on MicaZ and TelosBmotes. The inputs to the hash function are randomly generatednumbers with length varying from 24 to 156 bytes inincrements of 6 bytes. Note that, in our protocol, the hashfunction is applied to an entire packet. There are several reasonsthat, possibly, a packet contains a few tens of bytes.First, the advertisement packet has additional informationsuch as certificate and signature. Second, with the Merklehash tree method, each packet contains the disseminateddata item along with the related internal nodes of the treefor verification purpose. Third, although several bytes is atypical size of a data item, sometimes a disseminated dataitem may be a bit larger. Moreover, for sensors with IEEE802.15.4 compliant radios, the maximum payload size is 102bytes for each packet. Therefore, we have chosen a widerrange of input size to SHA-1 to provide readers a more completepicture of the performance. We perform the sameexperiment 10,000 times and take an average over them. Forexample, the execution times on a MicaZ mote for inputs of54 bytes, 114 bytes, and 156 bytes are 9.6788, 18.947, and28.0515 ms, respectively. Also, the execution times on aTelosB mote for inputs of 54, 114, and 156 bytes are 5.7263,10.7529, and 15.629 ms, respectively.To measure the execution time of public key cryptography,as shown in Table 41, we have implemented the ECCverification operation (with a random 20-byte number asthe output) of TinyECC 2.0 library [19] on MicaZ and TelosBmotes. For example, it is measured that the signature verificationtimes are 2.436 and 3.955 seconds, which are 252 and691 times longer than SHA-1 hash operation with a 54-byterandom number as input on MicaZ and TelosB motes,respectively. It can be seen packet authentication based onthe Merkle hash tree (or data hash chain) is much more efficient.Therefore, it is confirmed that DiDrip is suitable forsensor nodes with limited resources.Next, we compare the energy consumption of SHA-1 hashfunction and ECC verification under the condition that theradio of the mote is turned off. When a MicaZ mote is used inthe circuit, Vr ¼ 138 mV, I ¼ 6.7779 mA, Vm ¼ 2.8620 V,P ¼ 19.3983 mW. When a TelosB mote is used, Vr ¼ 38 mV,I ¼ 1.8664 mA, Vm ¼ 2.9620 V, P ¼ 5.5283 mW. With theexecution time obtained from Fig. 3, the energy consumptionon the motes due to the SHA-1 operation can be determined.For example, the energy consumption of SHA-1 operationwith a random 54-byte number as input on MicaZ andTelosB motes are 0.18775 and 0.03166 mJ, respectively. Also,the energy consumption of ECC signature verificationoperation on MicaZ and TelosB motes are 2835.2555 and1316.1777 mJ, respectively.Next, the impact of security functions on the propagationdelay is investigated in an experimental network as shownin Fig. 4. The network has 24 TelosB nodes arranged in a4 _ 6 grid. The distance between each node is about 35 cm,TABLE 3Running Time for Each Phase of the Basic Protocol of DiDrip (Except the Sensor Node Verification Phase)Fig. 3. The execution times of SHA-1 hash function on MicaZ and TelosBmotes.TABLE 4Running Time for ECC Signature VerificationFig. 4. The 4 _ 6 grid network of TelosB motes for measuring propagationdelay.1. Note that ECC-160 is faster than ECC-128, because the columnwidth of ECC-160 is set to 5 for hybrid multiplication optimizationwhile that of ECC-128 is set to 4.1136 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015and the transmission power is configured to be the lowestlevel so that only one-hop neighbours are covered in thetransmission range. The repeater is acted by the node locatingat the vertex of the grid.In the experiments, the packet delivery rate from the networkuser is 5 packets/s. The lengths of round and datafields in a data item are set to 4 bits and 2 bytes, respectively.A hash function with 8-byte truncated output is usedto construct data hash chains. An ECC-160 signature is 40bytes long. Each experiment is repeated 20 times to obtainan average measurement. Figs. 5 and 6 plot the averagepropagation delays of Drip, DiDrip1, and DiDrip2 when thedata hash chain and Merkle hash tree methods areemployed, respectively. It can be seen that the propagationdelay almost increases linearly with the number of dataitems per round for all three protocols. Moreover, the securityfunctions in DiDrip2 have low impact on propagationdelay. For these five experiments of the data hash chainmethod, DiDrip2 is just 3.448, 4.158, 3.222, 3.919 and 2.855 smore than that of Drip, respectively. Note that the increasein propagation delay is dominated by the signature verificationtime incurred at the one-hop neighboring nodes of thebase station. This is because each node carries out signatureverification only after forwarding data packets (with validpuzzle solutions).Table 5 shows the memory (ROM and RAM) usage ofDiDrip2 (with the data hash chain method) on MicaZ andTelosB motes for the case of four data items per round. Thecode size of Drip and a set of verification functions fromTinyECC (secp128r1, secp160r1 and secp192r1, which areimplementations based on various elliptic curves accordingto the Standards for Efficient Cryptography Group) areincluded for comparison. For example, the size of DiDripimplementation corresponds to 26.18 and 56.82 percent of theRAMand ROMcapacities of TelosB, respectively. Clearly, theROM and RAM consumption of DiDrip is more than that ofDrip because of the extra security functions. Moreover, it canbe seen thatmajority of the increasedROMis due to TinyECC.8 CONCLUSION AND FUTURE WORKIn this paper, we have identified the security vulnerabilitiesin data discovery and dissemination when used inWSNs, which have not been addressed in previousresearch. Also, none of those approaches support distributedoperation. Therefore, in this paper, a secure and distributeddata discovery and dissemination protocolnamed DiDrip has been proposed. Besides analyzing thesecurity of DiDrip, this paper has also reported the evaluationresults of DiDrip in an experimental network ofresource-limited sensor nodes, which shows that DiDripis feasible in practice. We have also given a formal proofof the authenticity and integrity of the disseminated dataitems in DiDrip. Also, due to the open nature of wirelesschannels, messages can be easily intercepted. Thus, in thefuture work, we will consider how to ensure data confidentialityin the design of secure and distributed datadiscovery and dissemination protocols.APPENDIXFURTHER IMPROVEMENT OF DIDRIP SECURITY ANDEFFICIENCYBy the basic DiDrip protocol, we can achieve secure and distributeddata discovery and dissemination. To furtherenhance the protocol, here we propose two modifications toFig. 5. Propagation delay comparison of three protocols when the datahash chain method is employed.Fig. 6. Propagation delay comparison of three protocols when the Merklehash tree method is employed.TABLE 5Code Sizes (Bytes) on MicaZ and TelosB MotesHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1137improve the efficiency and security of DiDrip. For brevity,only those parts of the basic protocol that require changeswill be presented.Avoiding the Generation, Transmission andVerification of CertificatesThere are some efficiency problems caused by the generation,transmission, and verification of certificates. First, it isnot efficient in communication, as the certificate has to betransmitted along with the advertisement packet acrossevery hop as the message propagates in the WSN. A largeper-message overhead will result in more energy consumptionon each sensor node. Second, to authenticate each advertisementpacket, it always takes two expensive signatureverification operations because the certificate should alwaysbe authenticated first. To address these challenges, a feasibleapproach is that before the network deployment, the publickey/dissemination-privilege pair of each network user isloaded into the sensor nodes by the network owner. Once anew user joins the network after the network deployment,the network owner can notify the sensor nodes of the user’spublic key/dissemination privilege through using the privatekey of itself. The detailed description is as follows.User Joining PhaseAccording to the basic protocol of DiDrip, user Uj generatesits public and private keys and sends a 3-tuple<UIDj; Prij; PKj > to the network owner. When the networkowner receives the 3-tuple, it no longer generates thecertificate Certj. Instead, it signs the 3-tuple with its privatekey and sends it to the sensor nodes. Finally, each nodestores the 3-tuple.Packet Pre-Processing PhaseThe user certificate Certj stored in packet P0 is replaced byUIDj.Packet Verification PhaseIf this is an advertisement packet, according to the receivedidentity UIDj, node Sj first picks up the dissemination privilegePrij from its storage and then pays attention to thelegality of Prij. If the result is positive, node Sj uses the publickey PKj from its storage to run an ECDSA verify operationto authenticate the signature; otherwise, node Sjsimply discards the packet. Note that node Sj does not needto authenticate the certificate.As described above, the public-key/dissemination-privilegepair <UIDj; Prij; PKj > of each network user is just2 þ 6 þ 40 ¼ 48 bytes. Therefore, assuming the protocolsupports 500 network users, the code size is about 23 KB.We consider the resource-limited sensor nodes such asTelosB motes as examples. The 1-MB Flash memory isenough for storing these public parameters.Message Specific Puzzle Approach for Resistance toDoS attacksDiDrip uses a digital signature to bootstrap the authenticationof a round of data discovery and dissemination. Thisauthentication is vulnerable to DoS attacks. That is, anadversary may flood a lot of illegal signature message (i.e.,advertisement messages in this paper) to the sensor nodes toexhaust their resources and render them less capable of processingthe legitimate signature messages. Such an attack canbe defended by applying the message specific puzzleapproach [2]. This approach requires each signature messageto contain a puzzle solution. When a node receives a signaturemessage, it first checks that the puzzle solution is correctbefore verifying the signature. There are two characteristicsof the puzzles. First, the puzzles are difficult to be solved buttheir solutions are easy to be verified. Second, there is a tighttime limit to solve a puzzle. This discourages adversaries tolaunch the DoS attack even if they are computationally powerful.More details about this approach can be found in [2].Another advantage of applying the message specific puzzleis to reduce the dissemination delay, which is the time fora disseminated packet to reach all nodes in a WSN. Recallthat in step 1.a) of the packet verification phase, when a nodereceives the signature packet, it first carries out the signatureverification before using the Trickle algorithm to broadcastthe signature packet. This means that the disseminationdelay depends on the signature verification time tsv. On theother hand, when the message specific puzzle approach isapplied, a node can just verify the validity of puzzle solutionbefore broadcasting the signature packet. Then, the disseminationdelay only depends on the puzzle solution verificationtime tpv. Since tpv _ tsv, the dissemination delay issignificantly reduced. Moreover, the reduction in disseminationdelay is proportional to the network size. This is demonstratedby the experiments presented in Section 7.2.ACKNOWLEDGMENTSThis research is supported by a strategic research grant fromCity University of Hong Kong [Project No. 7004054], theFundamental Research Funds for the Central Universities,and the Specialized Research Fund for the Doctoral Programof Higher Education. D. He is the correspondingauthor of this article.and MEng degrees from the Harbin Institute ofTechnology, China, and the PhD degree fromZhejiang University, China, all in computerscience in 2007, 2009, and 2012, respectively.He is with the School of Computer Science andEngineering, South China University of Technology,P.R. China, and also with the College ofComputer Science and Technology, ZhejiangUniversity, P.R. China. His research interestsinclude network and systems security. He is anassociate editor or on the editorial board of some international journalssuch as IEEE Communications Magazine, Springer Journal of WirelessNetworks, Wiley’s Wireless Communications and Mobile ComputingJournal, Journal of Communications and Networks, Wiley’s Security andCommunication Networks Journal, and KSII Transactions on Internetand Information Systems. He is a member of the IEEE.Sammy Chan (S’87-M’89) received the BE andMEngSc degrees in electrical engineering fromthe University of Melbourne, Australia, in 1988 and1990, respectively, and the PhD degree in communicationengineering from the Royal MelbourneInstitute of Technology, Australia, in 1995. From1989 to 1994, he was with Telecom AustraliaResearch Laboratories, first as a research engineer,and between 1992 and 1994 as a seniorresearch engineer and project leader. SinceDecember 1994, he has been with the Departmentof Electronic Engineering, City University of Hong Kong, where he is currentlyan associate professor. He is a member of the IEEE.Mohsen Guizani (S’85-M’89-SM’99-F’09)received the BS (with distinction) and MSdegrees in electrical engineering, the MS andPhD degrees in computer engineering in 1984,1986, 1987, and 1990, respectively, from SyracuseUniversity, Syracuse, New York. He is currentlya professor and the associate vicepresident for Graduate Studies at Qatar University,Qatar. His research interests include computernetworks, wireless communications andmobile computing, and optical networking. Hecurrently serves on the editorial boards of six technical journals andthe founder and EIC of “Wireless Communications and MobileComputing” Journal published by John Wiley (http://www.interscience.wiley.com/jpages/1530-8669/). He is a fellow of the IEEE and a seniormember of ACM.Haomiao Yang (M’12) received the MS and PhDdegrees in computer applied technology from theUniversity of Electronic Science and Technologyof China (UESTC) in 2004 and 2008, respectively.From 2012 to 2013, he worked as a postdoctoralfellow at Kyungil University, Republic ofKorea. Currently, he is an associate professor atthe School of Computer Science and Engineering,UESTC, China. His research interestsinclude cryptography, cloud security, and bigdata security. He is a member of the IEEE.Boyang Zhou is currently working toward thePhD degree from the College of Computer Scienceat Zhejiang University. His research areasinclude software-defined networking, futureinternet architecture and flexible reconfigurablenetworks.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1139

Receiver Cooperation in Topology Control for Wireless Ad-Hoc Networks

05/08/201902/07/2019 by admin

Abstract—We propose employing receiver cooperation in centralizedtopology control to improve energy efficiency as well asnetwork connectivity. The idea of transmitter cooperation hasbeen widely considered in topology control to improve networkconnectivity or energy efficiency. However, receiver cooperationhas not previously been considered in topology control. In particular,we show that we can improve both connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. Consequently, we conclude that a system based bothon transmitter and receiver cooperation is generally superior toone based only on transmitter cooperation. We also show that theincrease in network connectivity caused by employing transmittercooperation in addition to receiver cooperation is at the expense ofsignificantly increased energy consumption. Consequently, systemdesigners may opt for receiver-only cooperation in cases for whichenergy efficiency is of the highest priority or when connectivityincrease is no longer a serious concern.Index Terms—Ad-hoc network, energy efficiency, multi-hopcommunications, network connectivity, receiver cooperation,topology control, transmitter cooperation.I. INTRODUCTIONTHE wireless ad-hoc network has been receiving growingattention during the last decade for its various advantagessuch as instant deployment and reconfiguration capability. Ingeneral, a node in a wireless ad-hoc network suffers fromconnectivity instability because of channel quality variation andlimited battery lifespan. Therefore, an efficient algorithm forcontrolling the communication links among nodes is essentialfor the construction of a wireless ad-hoc network. In a topologycontrol scheme, communication links among nodes are definedto achieve certain desired properties for connectivity, energyconsumption, mobility, network capacity, security, and so on.In this paper, we propose topology control schemes that aimManuscript received February 23, 2014; revised July 24, 2014 and November12, 2014; accepted November 12, 2014. Date of publication December 4, 2014;date of current version April 7, 2015. Part of this work was presented at IEEEWCNC, Shanghai, China, April 2013. This work was supported in part byBasic Science Research Program through the National Research Foundationof Korea (NRF) funded by the Ministry of Education (NRF-2010-0025062 andNRF-2013R1A1A2011098). The associate editor coordinating the review ofthis paper and approving it for publication was M. Elkashlan. (Correspondingauthors: Do-Sik Yoo and Seong-Jun Oh).K. Moon, W. Lee, and S.-J. Oh are with Korea University, Seoul 136-701,Korea (e-mail: keith@korea.ac.kr; wlee@korea.ac.kr; seongjun@korea.ac.kr).D.-S. Yoo is with the Department of Electronic and Electrical Engineering,Hongik University, Seoul 121-791, Korea (e-mail: yoodosik@hongik.ac.kr).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TWC.2014.2374617to increase the energy efficiency and the network connectivitysimultaneously.In a wireless ad-hoc network, two nodes that are not directlyconnected may possibly communicate with each other throughso-called multi-hop communications [1], [2]. By employingmulti-hop communication, a node in a wireless ad-hoc networkcan extend its communication range through cascaded multihoplinks and eliminate some dispensable links to reduce the totalrequired power. Various efforts have been made to study howthe links must be maintained and how much power must be associatedwith each of those links for optimal network operationsdepending on the situation at hand. For example, Kirousis et al.[3] and Clementi et al. [4] studied the problem of minimizingthe sum power consumption of the nodes in an ad-hoc networkand showed that this problem is nondeterministic polynomialtime(NP) hard. Because the sum power minimization problemis NP hard, the authors in [4] proposed a heuristic solution forpractical ad-hoc networks. Ramanathan and Rosales-Hain, in[5], proposed two topology control schemes that minimize themaximum transmission power of each node with bi-directionaland directional strong connectivities, respectively. When thenumber of participating nodes is very large, it is crucial toreduce the transmission delay due to multi-hop transmissions.To maintain the total transmission delay within a tolerable limit,Zhang et al. studied delay-constrained ad-hoc networks in [6]and Huang et al. proposed a novel topology control scheme in[7] by predicting node movement.In [3]–[7], it was assumed that there exists a centralizedsystem controlling nodes so that global information such asnode positions and synchronization timing is known by eachnode in advance. However, such an assumption can be toostrong, especially in the case of ad-hoc networks. For thisreason, a distributed approach has been widely considered [8]–[11], where each node has to make its decision based on the informationit has collected from nearby neighbor nodes. Li et al.proposed a distributed topology control scheme in [8] andproved that the distributed topology control scheme preservesthe network connectivity compared with a centralized one.Because the topology control schemes in [3]–[8] guarantee onlyone connected neighbor for each node, the network connectivitycan be broken even when only a single link is disconnected.Accordingly, a reliable distributed topology control scheme thatguarantees at least k-neighbors was proposed in [9]. The resultin [9] was extended to a low computational complexity schemein [10], to a mobility guaranteeing scheme in [11], and to anenergy saving scheme in [12].1536-1276 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1859In [13], the concept of cooperative communications was firstemployed in centralized topology control, where it was shownthat cooperative communications can dramatically reduce thesum power consumption in broadcast network. Cardei et al.applied the idea of [13] to wireless ad-hoc networks in [14].Yu et al. further showed that cooperative communications canextend the communication range of each node with only amarginal increment in power consumption so that networkconnectivity is increased in an energy efficient manner [15],[16]. Because of these various advantages, the idea of cooperativecommunications has been widely considered in recentstudies on topology control to maximize capacity [17], improverouting efficiency [18], and mitigate interference from nearbynodes [19], [20]. The idea of cooperative communications inthese previous works [13]–[20] is realized in the followingway. First, a transmitting node sends a message to its neighbornodes (called helper nodes). After the helper nodes decode themessage, they (as well as the transmitting node in some cases)retransmit the message to a receiving node, and the receivingnode decodes the message by combining the signals frommultiple nodes. Therefore, strictly speaking, only the conceptof transmitter cooperation has been employed, and receivercooperation has not been considered.In this paper, we propose to employ the idea of receivercooperation in centralized topology control schemes, possiblyin combination with transmitter cooperation, to increase thenetwork connectivity in an energy efficient way. Consequently,we propose two centralized topology control schemes, onebased solely on receiver cooperation, and the other basedboth on transmitter and receiver cooperation. For comparisonwith proposed schemes, we consider a cooperative topologycontrol scheme in [16] that is based solely on transmittercooperation. We show, through extensive simulations, that wecan improve both network connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. We conclude that the system based both ontransmitter and receiver cooperation is generally superior tothat based only on transmitter cooperation. We also showthat the system based solely on receiver cooperation is asenergy efficient as one based both on transmitter and receivercooperation despite a slight decrease in network connectivity.Although the system based both on transmitter and receivercooperation achieves higher network connectivity than onebased only on receiver cooperation, we show that the additionalconnectivity increase requires significantly increased energyconsumption. For this reason, system designers may opt forreceiver-only cooperation, if energy efficiency is of the highestpriority or connectivity increase is no longer of seriousconcern.The remainder of this paper is organized as follows.In Section II, we describe the channel model consideredthroughout this paper. In Section III, we explain the topologycontrol scheme without cooperation that underlies the twocooperative topology control schemes considered in this paper.The two cooperative topology control schemes are then describedin Section IV. Furthermore, the performance of the twocooperative topology control schemes are numerically analyzedin Section V. Finally, we draw conclusions in Section VI.II. SYSTEM MODELIn this section, we describe the system model consideredthroughout this paper. We consider a network V ≡{v1, v2, . . . , vn} consisting of n nodes that are assumed to beuniformly distributed over a certain region in R2. The nodesare assumed to communicate with one another by transmittingsignals over a wireless channel with given bandwidth W. Weassume that the physical location of each node does not changewith time.To model a practical wireless channel, we assume that thepath loss PL(di j) between nodes vi and vj is given byPL(di j)[dB] = PLd0 +10k log_di jd0_+2loghi j +Xó+c. (1)Here, PLd0 is the reference path loss at unit distance d0 obtainedfrom the free space path loss model [21], and k denotes the pathloss exponent that represents how quickly the transmit powerattenuates as a function of the distance. The variables di j andhi j respectively denote the distance and the randomly varyingfast fading coefficient between vi and vj . In addition, Xó is arandom variable introduced to account for the shadowing effect.We assume that hi j and Xó vary independently from packetto packet, but remain constant during each packet duration.We assume further that h2i j follows a ÷2-distribution with twodegrees of freedom and Xó follows a normal distribution withzero mean and standard deviation ó. Finally, the variable c is theoffset correction factor between the mathematical model andfield measurement. We note that the values of PLd0 , d0, k, ó,and c vary depending on channel scenario, urban or suburban[22]. For given PLd0 , d0, k, ó, and c, when node vi transmitsa signal to node vj with power Pi, the received signal to noiseratio (SNR) ãi j(Pi) is given asãi j(Pi) =PiN0, jW×100.1×PL(di j), (2)where N0, j denotes the one-sided noise power spectral densityat vj . Throughout this paper, we assume that the maximumtransmit power of each node is given by Pmax.As the final issue in the system model, we briefly discuss networksynchronization. Communication in a completely asynchronousmanner is impossible, or at least be very difficult toachieve. In fact, synchronization can be a particularly importantissue in ad-hoc networks [23]–[25]. In this paper, we assumethat symbol level synchronization is maintained among participatingnodes. Although detailed synchronization techniquesare not the main focus of this paper, we briefly describe howthe issue of synchronization can be resolved with existingmethods. Synchronization techniques have been reported thatit can achieve time errors around 3 ∼ 7 μs. At such a level ofsynchronization, it will become desirable to maintain symbolduration longer than 50 μs, which corresponds to symbol rateof up to 20 kilo-symbols per second. A symbol rate of 20 kilosymbolswith rudimentary binary phase shift keying (BPSK)modulation results in a data-rate of only 20 kbps, which is notvery high. However, we can employ multi-carrier techniquessuch as orthogonal frequency division multiplexing (OFDM)to increase the data rate while maintaining or reducing thesymbol rate. For example, if we employ an OFDM system1860 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 1. A pictorial representation of G = (V,E) with V = {v1, . . . , v8} and E = {(v1;v2)NN, (v1;v3)NN, (v4;v5)NN, (v4;v6)NN, (v4;v7)NN}.with 512 subcarriers, the data rate can be increased to about10 Mbps using a simple BPSK sub-carrier modulation scheme.Consequently, even with existing techniques such as the OFDMscheme and synchronization algorithms proposed in [25], it ispossible to maintain the symbol-level synchronization requiredto implement the algorithms proposed in this paper.III. NODE-TO-NODE TOPOLOGY CONTROLIn this section, we explain a topology control scheme, whichwe refer to as the node-to-node topology control (NNTC)scheme, that is based solely on node-to-node communicationlinks. To describe the NNTC scheme, we first consider theconcept of a wireless communication link between two nodesand its related definitions. In this paper, a wireless link betweentwo nodes is said to exist if the received SNR exceeds a certainthreshold, meaning that the packet error probability is below acertain level (corresponding to the threshold). More formally,we say that there exists a node-to-node (N-N) link from node vito node vj if and only iff (ãi j(Pi)) ≤ áô, (3)for a certain transmit power Pi ≤ Pmax from vi. Here, f : R + →[0,1] denotes the packet error probability function associatedwith the given coding and modulation scheme and áô is thegiven threshold on the packet error probability, which we callthe error threshold hereafter. We assume that f is a monotonicallydecreasing continuous function and that all the nodesshare the same packet error probability function f .1When there exists a uni-directional N-N link from vi to vj ,the power Pi that satisfies (3) with equality, which we denoteby PNN(vi → vj), is called the minimum N-N routable powerof N-N link from vi to vj . We note that PNN(vi → vj) directlyfollows from the definition thatPNN(vi →vj) =N0, jW f−1(áô)100.1×PL(di j). (4)If both the uni-directional N-N links from vi to vj and from vjto vi exist, we say that there exists an N-N bi-directional link,or simply an N-N link between the two nodes vi and vj that1In many previous works on topology controls [14]–[17], (3) is equivalentlywritten as ãi j(Pi) ≥ SNRô ≡ f−1(áô). However, to consider the receivercooperation scheme in a unified framework, we directly consider the packeterror probability function f .we denote by (vi; vj)NN. The minimum N-N round-trip powerPNN(vi, vj) of the bi-directional N-N link (vi; vj)NN is definedas the sum of the two uni-directional minimum N-N routablepowers, namely, asPNN(vi, vj) = PNN(vi →vj)+PNN(vj →vi). (5)We note that there are some situations in which two nodes viand vj can communicate with each other even if there is no NNlink between vi and vj . For example, we consider the case inwhich there are two N-N links (v1; v2)NN and (v1; v3)NN. In thiscase, v2 and v3 can exchange a message through v1 even if thereis no N-N link between v2 and v3. To route a message throughmultiple N-N links, all available N-N links should be knownto the nodes. To reduce the routing complexity, only some ofthe existing N-N links are used for communications in practice.By eliminating redundant links, we can simplify the messagerouting protocol and save power consumed for exchangingreference signals such as pilot and channel information [26],[27].We denote the set of N-N links to be used for routing by E.Consequently, (vi; vj)NN ∈ E means that there exists N-N link(vi; vj)NN and this N-N link is to be used for routing. Here, wenote that (vi; vj)NN /∈E does not necessarily mean that there isno N-N link between vi and vj . In graph theory, the combinationG = (V,E) of V and E is called a graph with vertex set V andedge set E. In the remainder of this paper, nodes and links shallalso be referred to as vertexes and edges, respectively.For a given E, if (vi; vj)NN ∈ E, vi is said to be a neighborof vj and vice versa. We denote by N(vi|E) the setof neighbors of vi. For illustration, we consider the graphG = (V,E) with V = {v1, v2, . . . , v8} and E = {(v1; v2)NN,(v1; v3)NN, (v4; v5)NN, (v4; v6)NN, (v4; v7)NN}, which compactlydescribes the situation in Fig. 1. In this example, v5, v6 andv7 are neighbors of v4, therefore, N(v4|E) = {v5, v6, v7}. Here,we note that v5 is not a neighbor of v7, however, it is possiblefor v5 to send a message to v7 if (v4; v5)NN and (v4; v7)NNare cascaded. Likewise, if vi and vj can send a message bidirectionallyusing a single or cascaded multiple N-N edges, wesay that vi and vj are connected by N-N edges. The maximal setof nodes connected by N-N edges in E is referred to as a cluster.For notational convenience, a given cluster {vi1 , vi2 , . . . , vim} isdenoted by Ùmax{i1,i2,…,im}. For instance, in Fig. 1, there arethree clusters {v1, v2, v3}, {v4, v5, v6, v7}, and {v8}, which aredenoted by Ù3, Ù7, and Ù8, respectively. As shown in thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1861Fig. 2. Steps to construct the edge set E for a given node distribution V. (a) Identification of all N-N links. (b) A typical example of a spanning forest ofGL = (V,L).example, several clusters can exist for a given graph.We denotethe set of all clusters by V . We note that V = {Ù3,Ù7,Ù8} inthe above example.We now describe precisely how the set E of N-N edges tobe used for routing in the NNTC scheme is constructed. For agiven node set V, the set L of all existing N-N links and the setV of clusters defined by the graph GL = (V,L) are identified.Next, the edge set E is defined as a subset of L such that thegraph G = (V,E) also leads to the same cluster set V as graphGL = (V,L). Several candidate algorithms exist that can buildE such as breath-first search (BFS) [28] and depth-first search(DFS) [29]. In this paper, we use the minimum-weight spanningforest (MSF) algorithm that aims to build a sparse edge setusing the optimal average power required for network structureconstruction [1], [8], [15], [16]. In the MSF algorithm, first a setTÙ called a minimum spanning tree (MST), is defined for eachcluster Ù ∈ V . After obtaining all the MSTs, the set FV , calledthe minimum spanning forest of V, is defined as the union of allthe MSTs, namely, asFV = _Ù∈VTÙ, (6)which is defined to be edge set E in the NNTC scheme.It now remains to describe how the MST TÙ is obtained foreach cluster Ù ∈ V . If Ù is a singleton, then TÙ is defined to bethe empty set /0. If Ù contains more than one node, to obtain TÙ,it is necessary to consider the set L|Ù of all edges that connectnodes in Ù. For instance, we consider the example depictedin Fig. 2(a) in which the network consists of three singletonclusters and nine non-singleton clusters. For a non-singletoncluster Ù encircled by a red colored line, the edge set L|Ù isdefined as the set of all edges inside the red circle. We call asubset T of L|Ù a spanning tree of Ù if and only if there are nocycles (loops) in T and if any two nodes in Ù are connected byedges in T. For example, the edge set of each cluster depictedin Fig. 2(b) is a spanning tree of that cluster. Among all theexisting spanning trees of Ù, the one that leads to the minimumedge-weight sum is referred to as the MST TÙ of Ù. Here, theminimum N-N round-trip power PNN(vi, vj) of the N-N link isused for the weight of each edge (vi, vj)NN ∈ L.We note that transmission through the link in FV is not completelyerror-free, but has a packet error probability of áô. However,in the following, we assume that the communication linkin FV is error-free, possibly with the help of an automatic repeatand request (ARQ) scheme. Clearly, the repeated transmissionwill consume additional energy. However, even with the simplestARQ scheme, the average required energy to complete asuccessful transmission is increased from a single transmission(with packet error rate áô) by a factor of 1/(1−áô) [30]. Wenote that the factor 1/(1−áô) is reasonably close to 1 if áô ischosen to be small, say, less than 0.1. Therefore, if áô is sufficientlysmall, the additional cost for error-free communicationis only a small fraction of the total cost and hence is negligible.IV. COOPERATIVE TOPOLOGY CONTROLWe note that inter-cluster communication, namely, communicationbetween nodes belonging to different clusters is notpossible solely through cascaded N-N links. To make interclustercommunications possible, [16] employed the idea oftransmitter cooperation in which multiple nodes in one clustersimultaneously transmit the same message to a single node inanother cluster. In [16], to keep the additional complexity due tothe employment of cooperative transmission manageable, it wasassumed that a pair of nodes belonging to two communicatingclusters were pre-assigned so that communications between thetwo clusters could only happen between these two nodes withthe help of nodes in their neighborhoods. We note that notonly the neighboring nodes around the transmitting node butalso the nodes around the receiving node can help to establishinter-cluster communications. Consequently, in this paper, wepropose to employ receiver cooperation in which the interclustercommunication is regarded as successful if the receivingnode or any of the neighboring nodes succeeds in receiving the1862 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015message correctly. If the neighboring nodes only around thereceiving node participate in the cooperation, the establishedlink between two clusters is referred to as the node-to-cluster(N-C) link. Furthermore, if neighboring nodes around boththe transmitting and receiving nodes participate in the linkestablishment, the inter-cluster communication link is calledcluster-to-cluster (C-C) link.In this section, we describe two centralized cooperativetopology control schemes based on N-C and C-C links thatare referred to as node-to-cluster topology control (NCTC) andcluster-to-cluster topology control (CCTC) schemes, respectively.In each of these cooperative topology control schemes,cooperative links are employed to connect the clusters obtainedfrom the graph G = (V,E) described in Section III. Consequently,the network configuration defined in a cooperativetopology control scheme is described by four sets, namely, theset V of nodes, the set E of edges used for routing in the NNTCscheme, the set V of clusters defined by the graph G = (V,E),and the set E of cooperative edges. For this reason, the networkconfigurations defined in the NCTC and CCTC schemes areidentified by GNC = (V,E,V ,ENC) and GCC = (V,E,V ,ECC),respectively. Here, ENC and ECC consist only of N-C and C-Cedges, respectively.A. NCTCIn this subsection, we describe how the network configurationGNC = (V,E,V ,ENC) corresponding to the NCTC schemeis defined. Given graph G = (V,E) and corresponding clusterset V , the edge set ENC is obtained in three steps. First, theset LNC of all N-C links connecting clusters in V is identified.Next, for each N-C link in LNC, the weight of the link isdefined as the minimum power required to establish it. Finally,the desired edge set ENC is defined as the MSF of the graphGLNC = (V ,LNC).To describe the NCTC scheme, we first define the nodeto-cluster (N-C) link. For more concrete understanding ofN-C link, we consider a simple example of receiver cooperationbetween two clusters Ù3 ={v1, v2, v3} and Ù7 ={v4, v5, v6, v7}.For illustration, we assume that the inter-cluster communicationlink between two clusters is established if the error probabilityis less than or equal to 0.1. We assume that the decoding errorprobabilities at nodes v4, v5, v6 and v7 are, respectively, given as0.3, 0.4, 0.8, and 0.9 when v1 sends a message with maximumpower. Consequently, node v1 and a node in Ù7 cannot establishinter-cluster communications between Ù3 and Ù7 throughN-N links. However, if any of the nodes in Ù7 succeed incorrectly decoding the message, the message can be routed toany of the desired nodes in Ù7. If such receiver cooperation isemployed, communication fails only when all four nodes v4, v5,v6, and v7 fail to decode the message at the same time. We notethat such a probability is 0.3×0.4×0.8×0.9 = 0.0864 < 0.1.For this reason, we say that cooperative communication linkbetween Ù3 and Ù7 is established.In the above example, all nodes in the receiving cluster tryto decode the transmitted message. However, if the size of thereceiving cluster is large, the routing protocol and maintenancecost can become very burdensome. For this reason, we assumethat a certain receiving node and its one hop neighbors participatein the receiver cooperation. To be more precise, for a givenpair of clusters, a certain node is selected from each clusterand the signal is assumed to be transmitted from either of thesetwo nodes and then received by the other node and its one-hopneighbors.We note that there exists a more aggressive method of receivercooperation than the one described above. For example,the bridge node can achieve a huge combining gain if thehelper nodes transmit observed soft information rather thandecoded bits. However, the transmission of the observed datagenerally consumes large amount of energy and bandwidth.Consequently, a sufficiently fine quantization must be consideredto employ soft combining. Because this problem ishighly complex, we assume in this paper that the helper nodesdecode the message and deliver it to the bridge node. However,considering the importance of this problem, serious researchemploying soft combining schemes should be pursued.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from the given graph G = (V,E) defined inthe NNTC scheme. We formally define the concept of an N-Clink as follows.Definition 1: Let vbl∈ Ùl and vbm∈ Ùm. Then, we saythat there exists a bi-directional N-C link, or simply, a N-Clink denoted by (vbl ,N(vbl|L); vbm,N(vbm|L))NC between Ùland Ùm, if and only ifÐvr∈{vbm}∪N(vbm|L)f_ãbl r(Pbl )_≤ áô (7)andÐvr∈{vbl}∪N(vbl|L)f (ãbmr(Pbm)) ≤ áô (8)for some Pbl≤ Pmax and Pbm≤ Pmax.Here, L denotes the set of all N-N links described inSection III. In other words, all one-hop neighbors of the receivingnode are assumed to participate in receiver cooperationregardless whether they belong to E. We note that the errorprobability between helper and bridge node is assumed tobe zero, as mentioned in Section III. For a given N-C link(vbl ,N(vbl|L); vbm,N(vbm|L))NC, nodes vbl and vbm and setsN(vbl|L) and N(vbm|L) are called the bridge nodes and helpersets, respectively.In Definition 1, we note that the sum of the Pbl and Pbm valuesthat satisfy (7) and (8) with equality is the minimum total transmissionpower required to make round-trip communicationbetween Ùl and Ùm through (vbl ,N(vbl|L); vbm,N(vbm|L))NC.Because the sum Pbl +Pbm depends on the choice of the N-Clink, it is natural to choose the N-C link that minimizes the sumpower Pbl +Pbm. The minimized sum power shall be referred toas the minimum N-C round-trip power and the correspondingN-C link as the minimum power N-C link between Ùl and Ùm.We denote by PNC(Ùl ,Ùm) the minimum N-C round-trip powerbetween Ùl and Ùm.We now describe how we establish communications betweenÙl and Ùm. First, let vbl∈ Ùl and vvm∈ Ùm be the bridgenodes of the minimum power N-C link between Ùl and Ùmand let Hl and Hm be the helper sets of the link. We nowMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1863Fig. 3. Steps to construct the edge set ENC for the given graph G = (V,E). (a) Identification of all N-C links. (b) A typical example of a spanning forest ofGLNC = (V ,LNC).assume that a source node vs in Ùl −{vbl} attempts to senda message to destination node vd in Ùm −{vbm}. In this case,vs sends the message to bridge node vbl through cascaded N-Nedges, and then bridge node vbl transmits the message to Ùm.The message sent from vbl is then decoded at bridge nodevbm and all the nodes in the helper set Hm. Because of thedefinition of the N-C link, the message must be decoded, withnegligible failure rate, at least at one node in {vbm} ∪ Hm.Because Hm consists only of the one hop neighbors of vbm, thenodes that successfully decode the message can be determinedby vbm with little overhead. After determining the nodes thatsuccessfully decoded the message, vbm delivers the message totarget destination node vd through the cascaded N-N edges.Finally, we describe how the edge set ENC is constructedin the NCTC scheme. First, the minimum power N-C link isidentified for each pair of clusters between which N-C linksexist. Let LNC denote the set of the minimum power N-C linksobtained as the result. For each (vbl ,Hl ; vbm,Hm)NC ∈ LNC,the weight is then defined as the corresponding minimum N-Cround-trip power. After computing all the weights of LNC, thesparse edge set ENC is defined as theMSF of GLNC =(V ,LNC).Note that the MSF construction procedure described inSection III can be directly applied here by substituting V andL with V and LNC, respectively. In Fig. 3, the procedure isillustrated. For instance, Fig. 3(a) indicates all the minimumpower N-C links between clusters by solid red lines andFig. 3(b) illustrates the shape of a typical spanning forest thatdoes not include any loops. Likewise, after finding all thespanning forests of GLNC = (V ,LNC), the one that minimizesthe sum weight is defined as the MSF ENC. After obtainingthe ENC, the desired final graph GNC = (V,E,V ,ENC) for theNCTC scheme is constructed.B. CCTCIn this subsection, we describe the CCTC scheme and explainhow the network configuration GCC = (V,E,V ,ECC) correspondingto the CCTC scheme is defined. We first explainthe concept of a cluster-to-cluster (C-C) link and the relatedrouting protocol with a simple example. We assume that sourcenode vs ∈ Ùl attempts to send a message to destination nodevd ∈ Ùm. In this case, vs sends a message through cascadedN-N edges to a pre-defined bridge node vbl . After receivingthe message, vbl disseminates the message to the nodes in apre-defined helper set Hl . After decoding the message, vbl andvhl∈ Hl simultaneously transmit the message to Ùm in thenext time frame. In Ùm, a pre-defined bridge node vbm andthe nodes in a pre-defined helper set Hm attempt to decode themessage with the multiple signal replicas from the transmitters.If the maximum ratio combiner (MRC) [31] is employed at thereceiving node vr ∈ {vbm}∪Hm, the combined average receivedSNR ¯ãr at vr can be written as¯ãr = ãbl r(Pbl)+ Óvhl∈Hlãhl r(Phl ), (9)and the decoding error probability at vr is given as f (¯ãr). Toestablish the symbol combining in (9), the same signals fromthe multiple transmitters should be received at the same timeas assumed in [13]. We note that problems related to timesynchronization were discussed in Section II. Similarly to thecase for N-C links, we say that the message is decodable, withnegligible failure rate, at least at one node in {vbm}∪Hm ifÐvr∈{vbm}∪Hmf (¯ãr) ≤ áô (10)with small enough áô, where f (·) denotes the common packeterror probability function for given received SNR, as defined inSection II. If the inequality (10) holds, we say that there exists aC-C link from Ùl to Ùm. Once the message is decoded at nodesin {vbm}∪Hm, the message is delivered to destination node vdthrough cascaded N-N edges to complete the routing procedure.To maintain the C-C link power efficiently, it is necessaryto choose appropriately the node pair (vbl , vbm), the helper set1864 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015(Hl ,Hm), and the transmission power from each transmittingnode to minimize the power consumption. However, the computationalcomplexity makes such an optimization algorithmhardly feasible not only in practical systems but also in simulationenvironments [14]. For this reason, it is widely assumedthat nodes participating in transmitter cooperation use the samepower [15], [16]. Consequently, we adopt the same assumptionwhen designing the CCTC scheme.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from a given graph G = (E,V). We definethe concept of a C-C link in the following definition.Definition 2: Let vbl∈ Ùl , vbm∈ Ùm, Hl ⊂ N(vbl|L), andHm ⊂ N(vbm|L). Then, we say that there exists a bi-directionalC-C link, or simply, a C-C link denoted by (vbl ,Hl ; vbm,Hm)CCbetween Ùl and Ùm if and only ifÐvr∈{vbm}∪Hmf⎛⎝ãbl r(Pcl)+ Óvhl∈Hlãhl r(Pcl )⎞⎠ ≤ áô, (11)andÐvr∈{vbl}∪Hlfãbmr(Pcm)+ Óvhm∈Hmãhmr(Pcm)_≤ áô (12)for some Pcl≤ Pmax and Pcm≤ Pmax.Here, Pcl and Pcm denote the common transmission powersof transmitting nodes in Ùl and Ùm, respectively. For a givenC-C link (vbl ,Hl ; vbm,Hm)CC, the nodes vbl and vbm are calledthe bridge nodes and the sets Hl and Hm are called the helpersets between Ùl and Ùm. Such terminology is the same for ofN-C links. However, in the case of C-C links, the nodes in thehelper set participate not only in receiver cooperation but alsoin transmitter cooperation.In Definition 2, we note that the total transmission powerminimally required to make round-trip communication betweenÙl and Ùm is given by (|Hl |+1)Pcl +(|Hm|+1)Pcm using thevalues for Pcl and Pcm that satisfy (11) and (12) with equality.Here, |X| denotes the cardinality of set X. We also note that therequired total transmission power (|Hl |+1)Pcl +(|Hm|+1)Pcmvaries depending on the choice of the C-C link. Consequently, itis natural to choose the C-C link that leads to the smallest totalrequired transmission power. The smallest total required transmissionpower and the corresponding C-C link are referred to asthe minimum C-C round-trip power and minimum power C-Clink between Ùl and Ùm, respectively. We denote the minimumC-C round-trip power between Ùl and Ùm by PCC(Ùl ,Ùm).We now describe how the edge set ECC is constructed inthe CCTC scheme. We note that the procedure for obtainingECC is essentially the same as that for obtaining ENC. Therefore,we describe it with brevity. First, the set LCC of all theminimum power C-C links between clusters is identified. Foreach (vbl ,Hl ; vbm,Hm)CC ∈ LCC, the weight is defined as thecorresponding minimum C-C round-trip power. After computingall the weights of LCC, the sparse edge set ECC is definedas the MSF of GLCC = (V ,LCC). After obtaining ECC, thedesired final graph GCC = (V,E,V ,ECC) for CCTC scheme isconstructed.Next, we briefly remark on the additional receiver processingcosts required for the NCTC and CCTC schemes. Comparedto the transmitter cooperative topology control scheme in [16],additional decoding power is required in the NCTC and CCTCschemes because of multiple-node decoding. This additionaldecoding increases not only the power consumption, but alsothe overall system complexity. Furthermore, each receivinghelper node should report the received message decodabilityto the bridge node, which increases system overhead. Thereare some analytical studies on receiving power consumption[32], [33] and overhead [34] because it could be a critical issuein the case of ad-hoc networks. However, we note that thedecoding power consumption and related overhead are heavilydependent on the receiving strategy. For example, one can chosea receiving strategy in which the receiving helper nodes decodethe message in the order of channel conditions until a successfuldecoding node appears. In this case, the average decodingpower consumption and system complexity can be reduced.In addition, the serach for the optimal receiving strategy ishighly non-trivial and requires serious and independent study.However, despite its importance, in this primary effort ontopology control, we do not consider such issues any furtherto keep the problem tractable.Finally, we briefly consider the impact of mobility on the proposedtopology control schemes. Unfortunately, the proposedschemes are basically inapplicable except when the mobilityis very low. When a node moves, three situations can happen.First, in some situations in which only minor movement isinvolved, there may be no changes in the network topologyexcept for the configurations inside the cluster to which themoved node belongs. Second, in other situations, the clusterto which the moved node originally belonged, must be dividedinto more than one cluster. Finally, in still other situations,some clusters could be unified into one cluster by the N-Nlinks newly defined by the node movement. In the first case,the mobility problem is relatively simple. If the moved node isnot a bridge or helper node, the moved node could be simplyattached to the nearby cluster. On the other hand, if the movednode is a bridge or helper node, the bridge and/or helper nodesof the corresponding cooperative link are changed to one of thealternatives among the pre-stored alternative bridge and helpernodes. However, if there is no alternative bridge and/or helpernode or if the second or the third situation occurs, clusters andcooperative edges should be redefined. In addition, if severalnodes move at the same time, the second and third situationsmay happen more frequently and this is why the proposedschemes are applicable only when the mobility is very low.V. PERFORMANCE EVALUATIONAND NUMERICAL RESULTSIn this section, we analyze through simulations the performanceof the two proposed centralized topology controlschemes, namely, the NCTC and CNTC schemes, and comparethem to the NNTC scheme and cooperative topology controlscheme in [16] that is based solely on transmitter cooperation.For convenience, we call the topology control scheme in[16] the cluster-to-node topology control scheme (CNTC). ToMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1865TABLE ISIMULATION CONFIGURATION PARAMETERSour best knowledge, the CNTC scheme achieves the highestconnectivity with a power requirement that is onl marginallygreater than other existing topology control schemes. In thissection, we show that the proposed NCTC scheme providesbetter energy efficiency with marginal connectivity loss and theCCTC scheme allows both better energy efficiency and higherconnectivity than the CNTC scheme.A. Simulation ConfigurationThe system performance is evaluated through simulations inthis paper. Although analytic evaluation is generally more desirable,the performance of topology control schemes is very hardto analyze. To the best of our knowledge, only some analyticalresults have been obtained for the case of non-cooperativecommunications among an infinite number of nodes [35], [36]and previous studies [13]–[16] on cooperative topology controlschemes have only been evaluated through numerical simulations.For this reason, we study the performance throughsimulations. However, we provide partial analytical reasoningwhenever possible. Furthermore, to improve the value of theresults, we reflect practical situations as much as possiblein simulation configuration by employing channel parametersbased on actual field measurement [22] and the design parametersin the 3GPP standard [37].To describe the system configuration used for performanceevaluation, we need to specify the values of various parameters,which we divide into two categories: channel parameters andsystem design parameters. The channel parameters includethe reference path loss PLd0 , path loss exponent k, shadowingrandom variable Xó, offset correction factor c, and noisepower spectral density N0,i. First, we assumed that N0,i, i =1, . . . ,n, were identically given as −174 dBm/Hz, the noisepower spectral density at the room temperature. For the otherchannel parameters PLd0 , k, Xó, and c, we consider two setsof values, given in Table I, that represent suburban and urbanscenarios [22].The system design parameters considered in this section arethe number of nodes n, simulation area A, error threshold áô,packet error function f , and maximum transmit power Pmax.Parameters n and A are closely related to the node density,which determines the number of nodes participating in thecooperation. Therefore, we varied n and A to observe how theperformance is influenced by the node density. The choice oferror function f depends on the error correction coding schemeemployed. In this study, we assume that a convolutional codewith a constraint length of two is used as the error correctioncoding scheme with a packet length of 1,024 [38]. Hence, weused the actual packet error rate obtained through extensivesimulations with the aforementioned convolutional code for thepacket error function f . For the choice of áô, we used 10−2,a value often adopted as the target packet error rate in manysituations. Finally, we assumed that the node power Pi is limitedby Pmax = 250 mW, and Pi is uniformly distributed over a10 MHz bandwidth. Detailed values of the above channel andsystem parameters are summarized in Table I.B. ConnectivityTo compare the level of performance achievable with theproposed topology control schemes, we first consider a metriccalled connectivity to measure the average proportion of nodesconnected to a node. Before proceeding with the formal definitionof metric connectivity, we observe that the performance ofa given topology control scheme depends not only on the valuesof n and A but also on the distribution of these n nodes over areaA. For this reason, we assume that n(≥ 2) nodes are randomlyand uniformly distributed over a given area A in the followingdiscussion.To formally define connectivity, we first denote the set ofall nodes connected to node vi by R(vi). We note that the setR(vi) depends on the choice of topology control schemes. Forinstance, in the NNTC scheme, R(vi) is the set of all nodesconnected to vi by an N-N edge. On the other hand, in acooperative topology control scheme, R(vi) consists of all thenodes that are connected through cascaded N-N and cascadedcooperative edges. Therefore, the connectivity Ã (of a giventopology control scheme) is defined asÃ =1nE_nÓi=1|R(vi)|n−1, (13)where |R(vi)| denotes the cardinality of R(vi). Here, the expectationE[·] has been taken because the cardinality |R(vi)|depends on how the nodes are distributed over a given area.We note that R(vi)/(n − 1) is the proportion of nodes thatare connected to vi and hence Ã is the expected value of itsarithmetic mean. For notational convenience, the connectivitiesof CCTC, NCTC, NNTC, and CNTC schemes are denoted byÃCC, ÃNC, ÃNN, and ÃCN, respectively.In Fig. 4, the connectivity for various topology controlschemes is shown as a function of the number of nodes nfor three different areas and two different environments. Mostimportantly, we observe that ÃCC ≥ ÃCN ≥ ÃNC ≥ ÃNN forall values of n and A and for any environment considered.We clearly see that either transmitter or receiver cooperationimproves connectivity. The fact that the CCTC scheme achievesthe highest connectivity is hardly surprising, hence what weactually need to observe is how the NCTC and CNTC schemes1866 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 4. Connectivity as a function of the number of nodes for various topology control schemes in various communication environments. (a) Urban. (b) Suburban.perform in comparison to it. In particular, since ÃCN ≤ ÃNC,we conclude that transmitter cooperation is more effective thanreceiver cooperation at achieving connectivity.C. Power ConsumptionSo far, we have observed that the CCTC scheme achievesthe highest connectivity and that the connectivity gap betweenthe CNTC and CCTC schemes is not large. In fact, it is notmore than 8% in most cases. Consequently, it is possible tosay that the CNTC scheme is a good alternative to the CCTCscheme if we consider connectivity only. However, the CNTCscheme is not as efficient as the CCTC scheme in terms ofpower consumption. Before proceeding with the analysis ofpower consumption, we define ˆ ECC to be the set of cluster pairscorresponding to the edges in ECC. In other words, (Ùl ,Ùm) ∈ˆ ECC, if and only if the edge set ECC contains the C-C edgebetween Ùl and Ùm. In a similar way, we denote the sets of thecluster pairs corresponding to edges in ENC and ECN by ˆ ENCand ˆ ECN, respectively.To quantitatively compare the power consumption of theCCTC and CNTC schemes, we now consider the following twoquantities¯PCC =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCC(ð)⎤⎦ (14)and¯PCN =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCN(ð)⎤⎦, (15)where PCN(ð) denotes the minimum C-N round-trip powerbetween the pair ð of clusters, similarly to PCC(ð) and PNC(ð)as defined in Section IV.We note that these quantities representthe average power required per each node to establish cooperativeedges between clusters in ˆ ECC ∩ ˆ ECN. Consequently, bycomparing ¯PCC and ¯PCN, we intend to compare the power requiredfor the CCTC and CNTC schemes to establish commoncooperative edges.Before proceeding with the evaluation of ¯PCC and ¯PCN, wefirst note that the two sets ˆ ECC− ˆ ECN and ˆ ECN − ˆ ECC of clusterpairs are not necessarily empty. Because the CCTC schemeemploys receiver cooperation in addition to transmitter cooperation,it appears reasonable to expect ˆ ECC − ˆ ECN to containsome sizable number of cooperative edges and ˆ ENC − ˆ ECC tobe empty. In fact, the average number of elements in ˆ ECC −ˆ ECN reaches as much as 25% of that of ˆ ECC ∩ˆ ECN in manysituations. However, interestingly, ˆ ECN − ˆ ECC is not necessarilyempty. This is because of the employment of MSF algorithm,that removes some redundant links. In other words, in CCTCschemes, some links used in the CNTC scheme are eliminatedby applying the MSF algorithms in some rare situations. Fromour numerical analysis, we found that the average cardinalityof ˆ ECN − ˆ ECC sometimes reaches as much as 8% of that ofˆ ECC∩ ˆ ECN. However, in most cases, the set ˆ ECN− ˆ ECC is emptyand hence ˆ ECC ∩ ˆ ECN is the same as ˆ ECN.Fig. 5(a) illustrates how the values of ¯PCC and ¯PCN changeas a function of the number of nodes n. We note that ¯PCCfirst increases as n increases and then decreases after n reachesa certain value. A similar tendency can be found in ¯PCN. Toexplain this non-monotonic performance of ¯PCC and ¯PCN, wedefine two quantitiesFCC =E_Óð∈ˆECC∩ˆECNPCC(ð)_E_|ˆ ECC ∩ ˆ ECN|_ (16)andFCN =E_Óð∈ˆECC∩ˆECNPCN(ð)_E_|ˆ ECC ∩ ˆ ECN|_ , (17)MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1867Fig. 5. The average additional power required per each node to establish cooperative edges in CCTC and CNTC schemes. (a) ¯PCN and ¯PCC. (b) ¯PCN over ¯PCC.to describe the average power consumed to establish a C-C linkand a C-N link, respectively. As a result, ¯PCC and ¯PCN can berewritten as¯PCC =1n·FCC ·N (18)and¯PCN =1n·FCN ·N , (19)where N = E[|ˆ ECC ∩ ˆ ECN|].While we cannot provide fully analytical behaviors of thequantities ¯PCC and ¯PCN, which is very difficult, it will be meaningfulto consider their qualitative behaviors. First, we note thatthe quantities FCC and FCN are mainly affected by the distancebetween clusters. It is natural to expect that the average clusterto-cluster distance will decrease with an increased number ofnodes n. However, the average cluster-to-cluster distance decreasesas a very slowly varying function of n, particularly aftern reaches a certain critical value. This is because two clustersare merged into one if the distance between them becomes tooclose. As a consequence, FCC and FCN decrease very slowly asn increases. For example, the minimum observed value of FCCwas only about 25% lower than the maximum observed value inthe simulation performed for an urban 2 × 2 km situation wheren ranged from 10 to 100. Because the quantities FCC and FCNare relatively unaffected by the variation of n, the behaviors of¯PCC and ¯PCN can possibly be accounted for by the behaviors ofthe average number of elementsN in ˆ ECC∩ ˆ ECN, which, in fact,varies very significantly as n varies. Let us observe, when thenode density is sufficiently low, that N increases as n increases,since increased n results in an increased number of clustersand then in an increased number of edges. However, when thenode density is high enough, adding nodes no longer makes thenumber of clusters larger because the addition of nodes nowresults in cluster unification. For this reason, N first increasesup to a certain critical value of n and then decreases againas n grows further. However, it is very difficult to predict thebehavior of N in a fully analytical manner, since N depends ontoo many factors such as node distribution, channel and fadingmodels, error probability function, and so on. As far as weknow, only a few analytical results [35], [36] have been derivedfor non-cooperative communications with an infinite number ofnodes and none for general cases or cooperative environments.We now discuss the simulation results of comparing ¯PCC and¯PCN. Because FCC and FCN vary slowly as functions of n, thevariations of ¯PCC and ¯PCN are dominantly determined by 1/nand N . When n = 10, N is almost zero since a very smallnumber of clusters exist and they are located too far away.As n increases up to a certain value, the number of clustersincreases so that the chance of cooperative communication alsoincreases. In this region, N grows faster than n, therefore, ¯PCCbecomes larger. On the other hand, if n exceeds a certain value,the number of clusters decreases, and eventually, it goes to one.Therefore, N quickly converges to zero with growing n, andthis is why ¯PCC decreases. In Fig. 5(a), we next observe that ¯PCCis always smaller than ¯PCN. To quantify the difference betweenthe two values, we illustrate the values of ¯PCN/¯PCC in Fig. 5(b),where we clearly see that ¯PCN is about 10–100% larger than¯PCC. From this figure, we clearly see that the CCTC schemerequires significantly less power than the CNTC scheme toestablish the same cooperative edges.Here, the question arises as to how the NCTC schemecompares to the CCTC scheme in terms of power consumption.First, we can compare the amount of power required for theCCTC and NCTC schemes to establish common cooperativeedges. In a similar comparison in Fig. 5, we noted that ¯PCCis significantly smaller than ¯PCN. However, in the case of theCCTC and NCTC schemes, there is virtually no differencebetween the powers required to establish common cooperativeedges. This is related to the assumption that the nodesparticipating in the cooperative transmission use the same1868 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 6. The relative amount of power required to establish one more additional cooperative edge with the CCTC scheme in comparison with the NCTC andCNTC schemes. (a) Urban. (b) Suburban.transmission power as in CCTC scheme. Because of this constrainton the transmission power, only one node is selected,even in the CCTC scheme, to transmit signals almost alwayswhenever the cooperative edge is contained in both ˆ ECC andˆ ENC. Therefore, it can be said that the NCTC scheme is almostas efficient as the CCTC scheme in terms of power consumption.Consequently, if the connectivity is of less priority thanthe power consumption or if the situation is such that theconnectivities of CCTC and NCTC are almost the same valuesbecause of a very high node density, the NCTC scheme can beconsidered to be a good alternative to the CCTC scheme. This isparticularly so because the average power required to establisha cooperative edge in ˆ ECC− ˆ ENC is significantly larger, in manycases, than the power required to establish cooperative edgein ˆ ENC.To illustrate this, we consider the metric ñCCNC defined asñCCNC =DCCNCKNC(20)in whichDCCNC =E_Óð∈ˆECC−ˆENCPCC(ð)_E_|ˆ ECC − ˆ ENC|_ (21)andKNC =E_Óð∈ˆENCPNC(ð)_E_|ˆ ENC|_ . (22)We note that DCCNC denotes the power required to establish oneC-C link that can not be established in NCTC scheme andthat KCCNC is the power consumption required for one N-C link.Consequently, the metric ñCCNC measures the relative amount ofpower required to establish one more additional cooperativeedge using the CCTC scheme in comparison to the NCTCscheme. In a similar manner, we define the metric ñCCCN byñCCCN =E_Óð∈ˆECC−ˆECNPCC(ð)_E_|ˆ ECC − ˆ ECN|_ ÷E_Óð∈ˆECNPCN(ð)_E_|ˆ ECN|_ (23)=DCCCNKCN(24)to quantify the relative amount of power required to establishone more additional cooperative edge using the CCTC schemein comparison to the CNTC scheme.In Fig. 6, we plot ñCCNC and ñCCCN as functions of n. Here,we first observe that the numerical values of ñCCNC and ñCCCNare around 3 and 1.2, respectively, for all cases considered.We note that, as mentioned in the explanation of Fig. 5, thepower consumed to establish a single cooperative link decreaseswith growing n so that DCCNC, DCCCN, KNC, and KCN are alldecreasing functions of n. In addition, we note that the powerrequired to establish a cooperative link is mainly affected bythe number of transmitting nodes and the transmitting powerof each node. We also note that the cooperative link betweentwo clusters is established by only a small number of nodeslocated near the boundary of each cluster, even when the clustersize is very large. This means that the number of transmittingnodes is almost constant, regardless of n. Therefore, the rateof decreasing power consumption is primarily affected by thetransmitting power of each node, which is closely related to thedistance between clusters. Because the configuration of clustersis identically given by the NNTC scheme, as n increases, thedecreasing rate of the power required to establish cooperativelinks is relatively similar for all three cooperative schemes,namely, the NCTC, CNTC, and CCTC schemes. For thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1869reason, the ratios DCCNC/KNC and DCCCN/KCN remain roughly thesame regardless of the value of n.We next observe that the values of ñCCNC, plotted by solid purplelines, are always around three. This means that to establishan edge that cannot be established in the NCTC scheme, theCCTC scheme requires about three times the power requiredto establish an edge in the NCTC scheme, regardless of thescenario and node density considered. Combining this resultwith the connectivity result in Fig. 4, we gain an importantinsight into the system design. When n = 50, the connectivityof the CCTC scheme is almost twice that of the NCTC scheme.Therefore, a three-fold increase in power consumption could bea reasonable choice if connectivity is of the highest priority.However, when n = 100, by employing the CCTC scheme,one would achieve 0.13% increase in connectivity, but threetimes more power would still be required. Therefore, somesystem designers may prefer the NCTC scheme to the CCTCscheme, for instance, where power efficiency is of the highestpriority or connectivity increase is not an issue. In contrast,ñCCCN, plotted by dotted by the green line, is about 1.2 in allcases. This means that only 20% more power is required to adda new cooperative edge using the CCTC scheme that cannotbe established in the CNTC scheme. Consequently, one canreplace the CNTC scheme with the CCTC scheme without aserious power consumption burden, regardless of node density.VI. CONCLUSIONIn this paper, we proposed to employ receiver cooperationin topology control to improve energy efficiency as well asnetwork connectivity. In particular, we proposed two centralizedtopology control schemes, one based solely on receivercooperation, and the other based both on transmitter and receivercooperations. For comparison, we also considered atopology control scheme that is based solely on transmittercooperation. By extensive simulation, we showed that we canimprove both connectivity and energy efficiency if we employreceiver cooperation in addition to transmitter cooperation.Consequently, it is generally more desirable to employ bothreceiver and transmitter cooperation than to employ transmittercooperation only. We also showed that the increase in networkconnectivity by employing transmitter cooperation in additionto receiver cooperation is at the expense of significantly increasedenergy consumption. For this reason, we conclude thatthe system based only on receiver cooperation could prove to bea good alternative to one based both on receiver and transmittercooperation, if energy efficiency is of the highest priority or theincrease in connectivity is no longer of serious concern.

Real-Time Path Planning Based on Hybrid-VANET-Enhanced Transportation System

05/08/201902/07/2019 by admin

Abstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.

Real-Time Detection of Traffic From Twitter Stream Analysis

05/08/201902/07/2019 by admin

Real-Time Detection of Traffic FromTwitter Stream AnalysisAbstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.

Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revocation

05/08/201902/07/2019 by admin

REAL-TIME BIG DATA ANALYTICAL ARCHITECTURE FOR REMOTE

SENSING APPLICATION

ABSTRACT:

In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.

The proposed architecture comprises three main units:

1) Remote sensing Big Data acquisition unit (RSDU);

2) Data processing unit (DPU); and

3) Data analysis decision unit (DADU).

First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU.

INTRODUCTION:

EXISTING SYSTEM:

Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.

The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.

In the case of simple algorithms, this is quite easy: the input is just split into different non-overlapping pieces that are processed one by one. But most algorithms do consider the neighborhood of each pixel. As a consequence, in most cases, the data will have to be split into partially overlapping pieces. The objective is to obtain the same result as the original algorithm as if the processing was done in one go. Depending on the algorithm, this is unfortunately not always possible.

DISADVANTAGES:

A reader that loads the image, or part of the image in memory from the file on disk;

A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;

A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;

A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.

PROPOSED SYSTEM:

We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.

On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.

We proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system. Furthermore, the proposed architecture has the capability of storing incoming raw data to perform offline analysis on largely stored dumps, when required. Finally, a detailed analysis of remotely sensed earth observatory Big Data for land and sea area are provided using .NET. In addition, various algorithms are proposed for each level of RSDU, DPU, and DADU to detect land as well as sea area to elaborate the working of architecture.

ADVANTAGES:

Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.

Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.

The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.

We have used the proposed architecture to perform analysis and proposed an algorithm for handling, processing, analyzing, and decision-making for remote sensing Big Data images using our proposed architecture.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

ARCHITECTURE DIAGRAM

MODULES:

DATA ANALYSIS DECISION UNIT (DADU):

DATA PROCESSING UNIT (DPU):

REMOTE SENSING APPLICATION RSDU:

FINDINGS AND DISCUSSION:

ALGORITHM DESIGN AND TESTING:

MODULES DESCRIPTION:

DATA PROCESSING UNIT (DPU):

In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.

Each processing server has its algorithm implementation for processing incoming segment of data from FLBS. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results generated by each server are then sent to the aggregation server for compilation, organization, and storing for further processing.

DATA ANALYSIS DECISION UNIT (DADU):

DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.

The aggregation server also sends the same copy of that result to the decision-making server to process that result for making decision. The decision-making server is supported by the decision algorithms, which inquire different things from the result, and then make various decisions (e.g., in our analysis, we analyze land, sea, and ice, whereas other finding such as fire, storms, Tsunami, earthquake can also be found). The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant since any small error in decision-making can degrade the efficiency of the whole analysis. DADU finally displays or broadcasts the decisions, so that any application can utilize those decisions at real time to make their development. The applications can be any business software, general purpose community software, or other social networks that need those findings (i.e., decision-making).

REMOTE SENSING APPLICATION RSDU:

Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.

However, to make the raw data into image format, the remote sensing satellite uses effective data analysis, remote sensing satellite preprocesses data under many situations to integrate the data from different sources, which not only decreases storage cost, but also improves analysis accuracy. The data must be corrected in different methods to remove distortions caused due to the motion of the platform relative to the earth, platform attitude, earth curvature, nonuniformity of illumination, variations in sensor characteristics, etc. The data is then transmitted to Earth Base Station for further processing using direct communication link. We divided the data processing procedure into two steps, such as real-time Big Data processing and offline Big Data processing. In the case of offline data processing, the Earth Base Station transmits the data to the data center for storage. This data is then used for future analyses. However, in real-time data processing, the data are directly transmitted to the filtration and load balancer server (FLBS), since storing of incoming real-time data degrades the performance of real-time processing.

FINDINGS AND DISCUSSION:

Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.

1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.

2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.

3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.

4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.

5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.

The MPH and SPH data are in ASCII format, whereas all the other data sets are in binary format. MDS, ADS, and GADs consist of the sequence of records and one or more fields of the data for each record. In our case, the MDS contains number of records, and each record contains a number of fields. Each record of the MDS corresponds to one row of the satellite image, which is our main focus during analysis.

ALGORITHM DESIGN AND TESTING:

Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.

IMPLEMENTATION:

Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.

Healthcare scenarios, medical practitioners gather massive volume of data about patients, medical history, medications, and other details. The above-mentioned data are accumulated in drug-manufacturing companies. The nature of these data is very complex, and sometimes the practitioners are unable to show a relationship with other information, which results in missing of important information. With a view in employing advance analytic techniques for organizing and extracting useful information from Big Data results in personalized medication, the advance Big Data analytic techniques give insight into hereditarily causes of the disease.

ALGORITHMS:

This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.

The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.

ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.

CONCLUSION AND FUTURE:

In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.

The algorithms proposed in this paper for each unit and subunits are used to analyze remote sensing data sets, which helps in better understanding of land and sea area. The proposed architecture welcomes researchers and organizations for any type of remote sensory Big Data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible for Big Data analysis for all applications, e.g., sensors and social networking. We are also planning to use the proposed architecture to perform complex analysis on earth observatory data for decision making at realtime, such as earthquake prediction, Tsunami prediction, fire detection, etc.

REFERENCES:

[1] D. Agrawal, S. Das, and A. E. Abbadi, “Big Data and cloud computing: Current state and future opportunities,” in Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.

[2] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for Big Data,” PVLDB, vol. 2, no. 2, pp. 1481–1492, 2009.

[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[4] H. Herodotou et al., “Starfish: A self-tuning system for Big Data analytics,” in Proc. 5th Int. Conf. Innovative Data Syst. Res. (CIDR), 2011, pp. 261–272.

[5] K. Michael and K. W. Miller, “Big Data: New opportunities and new challenges [guest editors’ introduction],” IEEE Comput., vol. 46, no. 6, pp. 22–24, Jun. 2013.

[6] C. Eaton, D. Deroos, T. Deutsch, G. Lapis, and P. C. Zikopoulos, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York, NY, USA: Mc Graw-Hill, 2012.

[7] R. D. Schneider, Hadoop for Dummies Special Edition. Hoboken, NJ, USA: Wiley, 2012.

Proof of Ownership In Deduplicated Storage With Mobile Device Efficiency

05/08/201902/07/2019 by admin

Cloud storage such as Dropbox and Bitcasa is one of the most popular cloud services. Currently, with the prevalence of mobile cloud computing, users can even collaboratively edit the newest version of documents and synchronize the newest files on their smart mobile devices. A remarkable feature of current cloud storage is its virtually infinite storage. To support unlimited storage, the cloud storage provider uses data deduplication techniques to reduce the data to be stored and therefore reduce the storage expense. Moreover, the use of data deduplication also helps significantly reduce the need for bandwidth and therefore improve the user experience. Nevertheless, in spite of the above benefits, data deduplication has its inherent security weaknesses. Among them, the most severe is that the adversary may have an unauthorized file downloading via the file hash only. In this article we first review the previous solutions and identify their performance weaknesses. Then we propose an alternative design that achieves cloud server efficiency and especially mobile device efficiency.

1.2 INTRODUCTION

Mobile devices have become prevalent in recent years, and mobile computing has been a growing trend. Meanwhile, cloud computing is definitely the biggest revolution in recent decades. Many tasks, such as document editing and file backup, have been shifted from end devices to the cloud. Therefore, with the convergence of mobile computing and cloud computing, along with the recent development of the 5G communication standard that establishes more reliable and faster communication channels, mobile cloud computing (MCC) could be a rapidly growing field that deserves to be investigated and explored.

Deduplicated Storage in Mobile Cloud Computing Among cloud services, cloud storage with the capability of file backup and synchronization could be the most popular service that enables mobile users to access their files everywhere. Dropbox (https://www.dropbox.com/) and Bitcasa (https://www.bitcasa.com/) are two examples that offer easy-to-use file backup and synchronization services. Several remarkable features of such cloud storage can be identified. It has high availability, which means that the user’s data will be replicated over cloud servers worldwide and is guaranteed to be accessible whenever the user has the need. It has the flexibility in a pay-as-you-go model, which means that the user can gain additional storage immediately whenever the user is willing to make an extra payment. The most important feature is that it has virtually infinite storage space, which means that the user can backup whatever he/she wants to be uploaded to the cloud. A renowned example is Bitcasa, which offers “unlimited storage” that enables the user to upload virtually everything. Offering infinite storage space might cause a severe economic burden on the cloud storage provider.

However, a technique called data deduplication helps significantly reduce the cost of storage. Data deduplication has been widely implemented by cloud storage providers including Dropbox and Bitcasa. According to the report in [8] (http://www.snia.org), the use of data deduplication in business applications may reduce the data to be stored and thus achieve disk and bandwidth savings of more than 90 percent. The power of data deduplication is achieved by avoiding storing the same file multiple times. The storage saving is more obvious especially when the popular multimedia contents such as music and movies are considered. The replicated contents create an additional storage need the first time they are uploaded, but create no extra storage need for subsequent uploads. In addition to storage saving, if the data content has been in the storage, then the replicated content has no need to be transmitted, achieving bandwidth saving. Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not.

We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a. Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

1.3 LITRATURE SURVEY

DUPLESS: SERVERAIDED ENCRYPTION FOR DEDUPLICATED STORAGE

AUTHOR: M. Bellare, S. Keelveedhi, and T. Ristenpart

PUBLISH: Proc. 22nd USENIX Conf. Sec. Symp., 2013, pp. 179–194.

EXPLANATION:

Cloud storage service providers such as Dropbox, Mozy, and others perform deduplication to save space by only storing one copy of each file uploaded. Should clients conventionally encrypt their files, however, savings are lost. Message-locked encryption (the most prominent manifestation of which is convergent encryption) resolves this tension. However it is inherently subject to brute-force attacks that can recover files falling into a known set. We propose an architecture that provides secure deduplicated storage resisting brute-force attacks, and realize it in a system called DupLESS. In DupLESS, clients encrypt under message-based keys obtained from a key-server via an oblivious PRF protocol. It enables clients to store encrypted data with an existing service, have the service perform deduplication on their behalf, and yet achieves strong confidentiality guarantees. We show that encryption for deduplicated storage can achieve performance and space savings close to that of using the storage service with plaintext data.

FAST AND SECURE LAPTOP BACKUPS WITH ENCRYPTED DE-DUPLICATION

AUTHOR: P. Anderson and L. Zhang

PUBLISH: Proc. 24th Int. Conf. Large Installation Syst. Admin., 2010, pp. 29–40.

EXPLANATION:

Many people now store large quantities of personal and corporate data on laptops or home computers. These often have poor or intermittent connectivity, and are vulnerable to theft or hardware failure. Conventional backup solutions are not well suited to this environment, and backup regimes are frequently inadequate. This paper describes an algorithm which takes advantage of the data which is common between users to increase the speed of backups, and reduce the storage requirements. This algorithm supports client-end per-user encryption which is necessary for confidential personal data. It also supports a unique feature which allows immediate detection of common subtrees, avoiding the need to query the backup system for every file. We describe a prototype implementation of this algorithm for Apple OS X, and present an analysis of the potential effectiveness, using real data obtained from a set of typical users. Finally, we discuss the use of this prototype in conjunction with remote cloud storage, and present an analysis of the typical cost savings.

SECURE DEDUPLICATION WITH EFFICIENT AND RELIABLE CONVERGENT KEY MANAGEMENT

AUTHOR: J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou

PUBLISH: IEEE Trans. Parallel Distrib. Syst., http://oi.ieeecomputersociety.org/10.1109/TPDS.2013.284, 2013

EXPLANATION:

Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Data de duplication is one of important data compression techniques for eliminating duplicate copies of repeating data, and has been widely used in cloud storage to reduce the amount of storage space and save bandwidth. Previous de duplication systems cannot support differential authorization duplicate check, which is important in many applications. In such an authorized de duplication system, each user is issued a set of privileges during system initialization Each file uploaded to the cloud is also bounded by a set of privileges to specify which kind of users is allowed to perform the duplicate check and access the files.

Before submitting his duplicate check request for a file, the user needs to take this file and his own privileges as inputs. The user is able to find a duplicate f or this file if and only if there is a copy of this file and a matched privilege stored in cloud. Traditional de duplication systems based on convergent encryption, although providing confidentiality to some extent; do not support the duplicate check with differential privileges. In other words, no differential privileges have been considered in the de duplication based on convergent encryption technique. It seems to be contradicted if we want to realize both de duplication and differential authorization duplicate check at the same time.

2.1.1 DISADVANTAGES:

De duplication systems cannot support differential authorization duplicate check.
One critical challenge of cloud storage services is the management of the ever increasing volume of data.
Users’ sensitive data are susceptible to both insider and outsider attacks.
Sometimes de duplication impossible.

2.2 PROPOSED SYSTEM:

We propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

2.2.1 ADVANTAGES:

POW scheme such as the bandwidth requirement, I/O overhead at both user and server sides, and the computation overhead at both sides concern the performance, the second is less known in the POW design. More specifically, cloud storage usually has a storage hierarchy: the memory (primary storage) and disk (secondary storage).

The execution of a POW scheme might require the user and cloud to access the file stored in the disk multiple times. The server might also need to keep the verification object in either the memory or the disk to verify the user’s claim.

The above all might result in a huge amount of I/O delay because of the access time gap between the memory and disk. In this article we focus only on the abuse of a file hash to gain the ownership of the file and aim to design a POW scheme with minimum performance overhead.

To prevent unauthorized access, a secure proof of ownership (POW) protocol is also needed to provide the proof that the user indeed owns the same file when a duplicate is found.
It makes overhead to minimal compared to the normal convergent encryption and file upload operations.
Data confidentiality is maintained.
Secure compared to proposed techniques

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

USER:

ADMIN:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

SENDER USER:

RECEIVER USER:

3.5 ACTIVITY DIAGRAM:

SENDER LOGIN:

RECEIVER LOGIN:

CHAPTER 4

4.0 IMPLEMENTATION:

MOBILE CLOUD COMPUTING:

Mobile Cloud Computing (MCC) is the combination of cloud computing, mobile computing and wireless networks to bring rich computational resources to mobile users, network operators, as well as cloud computing providers. The ultimate goal of MCC is to enable execution of rich mobile applications on a plethora of mobile devices, with a rich user experience. MCC provides business opportunities for mobile network operators as well as cloud providers. “A rich mobile computing technology that leverages uniﬁed elastic resources of varied clouds and network technologies toward unrestricted functionality, storage, and mobility to serve a multitude of mobile devices anywhere, anytime through the channel of Ethernet or Internet regardless of heterogeneous environments and platforms based on the pay-as-you-use principle.

ARCHITECTURE:

MCC uses computational augmentation approachesby which resource-constraint mobile devices can utilize computational resources of varied cloud-based resources. In MCC, there are four types of cloud-based resources, namely distant immobile clouds, proximate immobile computing entities, proximate mobile computing entities, and hybrid (combination of the other three models). Giant clouds such as Amazon EC2 are in the distant immobile groups whereas cloudlet or surrogates are member of proximate immobile computing entities. Smartphones, tablets, handheld devices, and wearable computing devices are part of the third group of cloud-based resources which is proximate mobile computing entities.

DIAGRAM:

In the MCC landscape, an amalgam of mobile computing, cloud computing, and communication networks (to augment smartphones) creates several complex challenges such as Mobile Computation Offloading, Seamless Connectivity, Long WAN Latency, Mobility Management, Context-Processing, Energy Constraint, Vendor/data Lock-in, Security and Privacy, Elasticity that hinder MCC success and adoption.

Although significant research and development in MCC is available in the literature, efforts in the following domains are still lacking:

Architectural issues: Reference architecture for heterogeneous MCC environment is a crucial requirement for unleashing the power of mobile computing towards unrestricted ubiquitous computing.
Energy-efficient transmission: MCC requires frequent transmissions between cloud platform and mobile devices, due to the stochastic nature of wireless networks, the transmission protocol should be carefully designed.
Context-awareness issues: Context-aware and socially-aware computing are inseparable traits of contemporary handheld computers. To achieve the vision of mobile computing among heterogeneous converged networks and computing devices, designing resource-efﬁcient environment-aware applications is an essential need.
Live VM migration issues: Executing resource-intensive mobile application via Virtual Machine (VM) migration-based application ofﬂoading involves encapsulation of application in VM instance and migrating it to the cloud, which is a challenging task due to additional overhead of deploying and managing VM on mobile devices.
Mobile communication congestion issues: Mobile data trafﬁc is tremendously hiking by ever increasing mobile user demands for exploiting cloud resources which impact on mobile network operators and demand future efforts to enable smooth communication between mobile and cloud endpoints.
Trust, security, and privacy issues: Trust is an essential factor for the success of the burgeoning MCC paradigm.

PROOF OF OWNERSHIP:

An even more severe and direct security threat from using deduplicated cloud storage is that the adversary may gain the ownership of files by only eavesdropping on file hashes. A closer look at client side deduplication can find that anyone in possession of the file hash can gain ownership of the file by uploading the file hash. More specifically, the cloud considers receiving a store request for a file already in the storage, avoids the redundant file transmission, and then adds the user as an additional owner of the file. An illustrative example is shown in Fig. 3d. Such a situation is apparently undesirable because in theory the adversary cannot infer the file content via the hash.

However, in this case, once the adversary knows the hash, it is able to download the entire file content. On the other hand, in practice, the user considers the hash unharmful and in some cases publishes the hashes as timestamps. However, the publicly available hashes can be abused to gain the file. This security weakness comes from using the static and short piece of information (hash) as a way of claiming file ownership. Motivated by this observation, Halevi et al. [10] introduce the notion of proof of ownership (POW). A POW scheme is jointly executed by the cloud and user such that the user can prove to the cloud that it is indeed in possession of the file.

4.1 ALGORITHM:

PUBLIC KEY INFRASTRUCTURE (PKI) AND PRIVATE KEY GENERATOR (PKG):

In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which is based on the Diffie–Hellman key exchange. It was described by Taher Elgamal in 1985. ElGamal encryption is used in the free GNU Privacy Guard software, recent versions of PGP, and other cryptosystems. The DSA (Digital Signature Algorithm) is a variant of the ElGamal signature scheme, which should not be confused with ElGamal encryption. The security of the ElGamal scheme depends on the properties of the underlying group as well as any padding scheme used on the messages.

If the computational Diffie–Hellman assumption (CDH) holds in the underlying cyclic group , then the encryption function is one-way. If the decisional Diffie–Hellman assumption (DDH) holds in , then ElGamal achieves semantic security. Semantic security is not implied by the computational Diffie–Hellman assumption alone. See decisional Diffie–Hellman assumption for a discussion of groups where the assumption is believed to hold.

To achieve chosen-ciphertext security, the scheme must be further modified, or an appropriate padding scheme must be used. Depending on the modification, the DDH assumption may or may not be necessary.

Other schemes related to ElGamal which achieve security against chosen ciphertext attacks have also been proposed. The Cramer–Shoup cryptosystem is secure under chosen ciphertext attack assuming DDH holds for. Its proof does not use the random oracle model. Another proposed scheme is DHAES whose proof requires an assumption that is weaker than the DDH assumption.

4.2 MODULES:

SECURE USER MODULES:

DEDUPLICATED STORAGE:

CHECK DEDUPLICATES:

APPLY POW SCHEME:

SECURE SEND KEY:

4.3 MODULE DESCRIPTION:

SECURE USER MODULES:

In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first.

Registration
File View
Encryption
Download
Upload Files
Encrypt and save to cloud

DEDUPLICATED STORAGE:

Client side deduplication incurs its own security weaknesses. First, the privacy of the file existence in the cloud may be compromised because the adversary may try to upload the candidate files to see whether the deduplication takes place. If the deduplication takes place, this will be an indica tor of the file’s existence. Otherwise, the adversary may infer the file’s nonexistence. The situation becomes even worse when we consider the low-entropy files because the adversary may exhaustively create different files and upload the hashes to check the file’s existence. For example, a curious colleague may query his/her manager’s salary by uploading different salary sheets because the sheets are of a similar form, restricting the number of file contents to be tested.

CHECK DEDUPLICATES:

Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not. We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a.

Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

APPLY POW SCHEME:

The POW schemes in performance very well on the server side since only a small size index (tree root) needs to be stored in the main memory. However, the proof of ownership is achieved by the user sending an authentication path of size O(log |f|) to the cloud, resulting in more communication overhead and computation load on the cloud. The I/O overhead of the user side is also increased, compared to the POW schemes in, because the user needs to retrieve the entire file. On the other extreme, although the s-POW schemes in have great computation and I/O efficiency in the user side, its I/O burden on the cloud is significantly increased since the cloud is required to retrieve random bits from the secondary storage.

In this article we propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

SECURE SEND KEY:

Once the key request was received, the sender can send the key or he can decline it. With this key and request id which was generated at the time of sending key request the receiver can decrypt the message.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

We propose an alternative POW design on the problem of unauthorized file downloading in deduplicated cloud storage. In our design, the use of probabilistic data structure, the Bloom filter, primarily contributes to the overhead reduction. Since the Bloom filter has been used widely in various applications and is easy to be implemented, our proposed POW scheme is considered realistic and can be deployed in real-world cloud storage services. Despite the use of the Bloom filter in reducing the I/O needs, the size of the Bloom filter may grow with the number of files stored in the cloud. The Bloom filter may also be of a huge size so that it needs to be partitioned and part of it needs to be stored in the disk. Thus, one possible future research focus is to develop a more succinct data structure or devise a new index mechanism such that the index (the Bloom filter in this article) can be fit into the memory even in the case of a huge number of files in the cloud.

Privacy-Preserving Detection of Sensitive Data Exposure

05/08/201902/07/2019 by admin

An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.

Finally, the prefetched data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed file systems for cloud environments to achieve better I/O performance. In particular, configurationlimited client machines in the cloud are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.

1.2 INTRODUCTION

The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.

However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.

On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.

Furthermore, considering only disk I/O events can reveal the disk tracks that can offer critical information to perform I/O optimization tactics certain prefetching techniques have been proposed in succession to read the data on the disk in advance after analyzing disk I/O traces. But, this kind of prefetching only works for local file systems, and the prefetched data iscached on the local machine to fulfill the application’s I/O requests passively in brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers.

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files’ metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

The file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of data intensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications benchmark to create OLTP workloads, since it is able to create similar OLTP workloads that exist in real systems. All the configured client file systems executed the same script, and each of them run several threads that issue OLTP requests. Because Sysbench requires MySQL installed as a backend for OLTP workloads, we configured mysqld process to 16 cores of storage servers. As a consequence, it is possible to measure the response time to the client request while handling the generated workloads.

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.

This paper makes the following two contributions:

1) Chaotic time series prediction and linear regression prediction to forecast disk I/O access. We have modeled the disk I/O access operations, and classified them into two kinds of access patterns, i.e. the random access pattern and the sequential access pattern. Therefore, in order to predict the future I/O access that belongs to the different access patterns as accurately as possible (note that the future I/O access indicates what data will be requested in the near future), two prediction algorithms including the chaotic time series prediction algorithm and the linear regression prediction algorithm have been proposed respectively. 2) Initiative data prefetching on storage servers. Without any intervention from client file systems except for piggybacking their information onto relevant I/O requests to the storage servers. The storage servers are supposed to log disk I/O access and classify access patterns after modeling disk I/O events. Next, by properly using two proposed prediction algorithms, the storage servers can predict the future disk I/O access to guide prefetching data. Finally, the storage servers proactively forward the prefetched data to the relevant client file systems for satisfying future application’s requests.

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

We have proposed, implemented and evaluated an initiative data prefetching approach on the storage servers for distributed file systems, which can be employed as a backend storage system in a cloud environment that may have certain resource-limited client machines. To be specific, the storage servers are capable of predicting future disk I/O access to guide fetching data in advance after analyzing the existing logs, and then they proactively push the prefetched data to relevant client file systems for satisfying future applications’ requests.

Purpose of effectively modeling disk I/O access patterns and accurately forwarding the prefetched data, the information about client file systems is piggybacked onto relevant I/O requests, and then transferred from client nodes to corresponding storage server nodes. Therefore, the client file systems running on the client nodes neither log I/O events nor conduct I/O access prediction; consequently, the thin client nodes can focus on performing necessary tasks with limited computing capacity and energy endurance.

Initiative prefetching scheme can be applied in the distributed file system for a mobile cloud computing environment, in which there are many tablet computers and smart terminals. The current implementation of our proposed initiative prefetching scheme can classify only two access patterns and support two corresponding prediction algorithms for predicting future disk I/O access. We are planning to work on classifying patterns for a wider range of application benchmarks in the future by utilizing the horizontal visibility graph technique applying network delay aware replica selection techniques for reducing network transfer time when prefetching data among several replicas is another task in our future work.

Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites

05/08/201902/07/2019 by admin

With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences.

We propose a two-level framework which according to the user’s available history on the site, determines the best available privacy policy for the user’s images being uploaded. Our solution relies on an image classification framework for image categories which may be associated with similar policies, and on a policy prediction algorithm to automatically generate a policy for each newly uploaded image, also according to users’ social features. Over time, the generated policies will follow the evolution of users’ privacy attitude. We provide the results of our extensive evaluation over 5,000 policies, which demonstrate the effectiveness of our system, with prediction accuracies over 90 percent.

1.2 INTRODUCTION

Images are now one of the key enablers of users’ connectivity. Sharing takes place both among previously established groups of known people or social circles (e. g., Google+, Flickr or Picasa), and also increasingly with people outside the users social circles, for purposes of social discovery-to help them identify new peers and learn about peers interests and social surroundings. However, semantically rich images may reveal contentsensitive information. Consider a photo of a students 2012 graduationceremony, for example.

It could be shared within a Google+ circle or Flickr group, but may unnecessarily expose the studentsBApos familymembers and other friends. Sharing images within online content sharing sites,therefore,may quickly leadto unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings. One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone. Therefore, many have acknowledged the need of policy recommendation systems which can assist users to easily and properly configure privacy settings. However, existing proposals for automating privacy settings appear to be inadequate to address the unique privacy needs of images due to the amount of information implicitly carried within images, and their relationship with the online environment wherein they are exposed.

1.3 LITRATURE SURVEY

TITLE NAME: SHEEPDOG: GROUP AND TAG RECOMMENDATION FOR FLICKR PHOTOS BY AUTOMATIC SEARCH-BASED LEARNING

AUTHOR: H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu,

PUBLISH: Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.

EXPLANATION:

Online photo albums have been prevalent in recent years and have resulted in more and more applications developed to provide convenient functionalities for photo sharing. In this paper, we propose a system named SheepDog to automatically add photos into appropriate groups and recommend suitable tags for users on Flickr. We adopt concept detection to predict relevant concepts of a photo and probe into the issue about training data collection for concept classification. From the perspective of gathering training data by web searching, we introduce two mechanisms and investigate their performances of concept detection. Based on some existing information from Flickr, a ranking-based method is applied not only to obtain reliable training data, but also to provide reasonable group/tag recommendations for input photos. We evaluate this system with a rich set of photos and the results demonstrate the effectiveness of our work.

TITLE NAME: CONNECTING CONTENT TO COMMUNITY IN SOCIAL MEDIA VIA IMAGE CONTENT, USER TAGS AND USER COMMUNICATION

AUTHOR: M. D. Choudhury, H. Sundaram, Y.-R. Lin, A. John, and D. D. Seligmann

PUBLISH: Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp.1238–1241.

EXPLANATION:

In this paper we develop a recommendation framework to connect image content with communities in online social media. The problem is important because users are looking for useful feedback on their uploaded content, but finding the right community for feedback is challenging for the end user. Social media are characterized by both content and community. Hence, in our approach, we characterize images through three types of features: visual features, user generated text tags, and social interaction (user communication history in the form of comments). A recommendation framework based on learning a latent space representation of the groups is developed to recommend the most likely groups for a given image. The model was tested on a large corpus of Flickr images comprising 15,689 images. Our method outperforms the baseline method, with a mean precision 0.62 and mean recall 0.69. Importantly, we show that fusing image content, text tags with social interaction features outperforms the case of only using image content or tags.

TITLE NAME: ANALYSING FACEBOOK FEATURES TO SUPPORT EVENT DETECTION FOR PHOTO-BASED FACEBOOK APPLICATIONS

AUTHOR: M. Rabbath, P. Sandhaus, and S. Boll,

PUBLISH: Proc. 2nd ACM Int. Conf. Multimedia Retrieval, 2012, pp. 11:1–11:8.

EXPLANATION:

Facebook witnesses an explosion of the number of shared photos: With 100 million photo uploads a day it creates as much as a whole Flickr each two months in terms of volume. Facebook has also one of the healthiest platforms to support third party applications, many of which deal with photos and related events. While it is essential for many Facebook applications, until now there is no easy way to detect and link photos that are related to the same events, which are usually distributed between friends and albums. In this work, we introduce an approach that exploits Facebook features to link photos related to the same event. In the current situation where the EXIF header of photos is missing in Facebook, we extract visual-based, tagged areas-based, friendship-based and structure-based features. We evaluate each of these features and use the results in our approach. We introduce and evaluate a semi-supervised probabilistic approach that takes into account the evaluation of these features. In this approach we create a lookup table of the initialization values of our model variables and make it available for other Facebook applications or researchers to use. The evaluation of our approach showed promising results and it outperformed the other the baseline method of using the unsupervised EM algorithm in estimating the parameters of a Gaussian mixture model. We also give two examples of the applicability of this approach to help Facebook applications in better serving the user.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Image content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users’ private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview on privacy related visual content available on the Web techniques to automatically detect private images, and to enable privacy-oriented image search.

To this end, we learn privacy classifiers trained on a large set of manually assessed Flickr photos, combining textual metadata of images with a variety of visual features. We employ the resulting classification models for specifically searching for private photos, and for diversifying query results to provide users with a better coverage of private and public content. Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings.

One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone of policy recommendation systems which can assist users too easily and properly configure privacy settings.

2.1.1 DISADVANTAGES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations.
Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content.
The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

2.2 PROPOSED SYSTEM:

We propose an Adaptive Privacy Policy Prediction (A3P) system which aims to provide users a hassle free privacy settings experience by automatically generating personalized policies. The A3P system handles user uploaded images, and factors in the following criteria that influence one’s privacy settings of images:

The impact of social environment and personal characteristics: Social context of users, such as their profile information and relationships with others may provide useful information regarding users’ privacy preferences. For example, users interested in photography may like to share their photos with other amateur photographers. Users who have several family members among their social contacts may share with them pictures related to family events. However, using common policies across all users or across users with similar traits may be too simplistic and not satisfy individual preferences.

Users may have drastically different opinions even on the same type of images. For example, a privacy adverse person may be willing to share all his personal images while a more conservative person may just want to share personal images with his family members. In light of these considerations, it is important to find the balancing point between the impact of social environment and users’ individual characteristics in order to predict the policies that match each individual’s needs.

The role of image’s content and metadata: In general, similar images often incur similar privacy preferences, especially when people appear in the images. For example, one may upload several photos of his kids and specify that only his family members are allowed to see these photos. He may upload some other photos of landscapes which he took as a hobby and for these photos, he may set privacy preference allowing anyone to view and comment the photos. Analyzing the visual content may not be sufficient to capture users’ privacy preferences. Tags and other metadata are indicative of the social context of the image, including where it was taken and why, and also provide a synthetic description of images, complementing the information obtained from visual content analysis.

2.2.1 ADVANTAGES:

The A3P-core focuses on analyzing each individual user’s own images and metadata, while the A3P-Social offers a community perspective of privacy setting recommendations for a user’s potential privacy improvement.

Our algorithm in A3P-core (that is now parameterized based on user groups and also factors in possible outliers), and a new A3P-social module that develops the notion of social context to refine and extend the prediction power of our system.

We design the interaction flows between the two building blocks to balance the benefits from meeting personal characteristics and obtaining community advice.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

ADMIN:

USER:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

ADMIN:

USER:

3.3 CLASS DIAGRAM:

ADMIN:

USER:

3.4 SEQUENCE DIAGRAM:

ADMIN:

USER:

3.5 ACTIVITY DIAGRAM:

ADMIN:

USER:

CHAPTER 4

4.0 IMPLEMENTATION:

A3P-CORE

There are two major components in A3P-core: (i) Image classification and (ii) Adaptive policy prediction. For each user, his/her images are first classified based on content and metadata. Then, privacy policies of each category of images are analyzed for the policy prediction. Adopting a two-stage approach is more suitable for policy recommendation than applying the common one-stage data mining approaches to mine both image features and policies together. Recall that when a user uploads a new image, the user is waiting for a recommended policy.

The two-stage approach allows the system to employ the first stage to classify the new image and find the candidate sets of images for the subsequent policy recommendation. As for the one-stage mining approach, it would not be able to locate the right class of the new image because its classification criteria need both image features and policies whereas the policies of the new image are not available yet. Moreover, combining both image features and policies into a single classifier would lead to a system which is very dependent to the specific syntax of the policy. If a change in the supported policies were to be introduced, the whole learning model would need to change.

A3P-SOCIAL

The A3P-social employs a multi-criteria inference mechanism that generates representative policies by leveraging key information related to the user’s social context and his general attitude toward privacy. As mentioned earlier, A3Psocial will be invoked by the A3P-core in two scenarios. One is when the user is a newbie of a site, and does not have enough images stored for the A3P-core to infer meaningful and customized policies. The other is when the system notices significant changes of privacy trend in theuser’s social circle, which may be of interest for the user to possibly adjust his/her privacy settings accordingly. In what follows, we first present the types of social context considered by A3P-Social, and then present the policy recommendation process.

4.1 ALGORITHM

Our algorithm performs better for users with certain characteristics. Therefore, we study possible factors relevant to the performance of our algorithm. We used a least squares multiple regression analysis, regressing performance of the A3P-core to the following possible predictors:

4.2 MODULES:

WEB-BASED IMAGE SHARING SERVICES:

METADATA-BASED CLASSIFICATION:

CONTENT-BASED CLASSIFICATION:

ADAPTIVE POLICY PREDICTION:

4.3 MODULE DESCRIPTION:

WEB-BASED IMAGE SHARING SERVICES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information. We expected that frequency of sharing pictures and frequency of changing privacy settings would be significantly related to performance, but the results indicate that the frequency of social network use, frequency of uploading images and frequency of changing settings are not related to the performance our algorithm obtains with privacy settings predictions. This is a particularly useful result as it indicates that our algorithm will perform equally well for users who frequently use and share images on social networks as well as for users who may have limited access or limited information to share.

METADATA-BASED CLASSIFICATION:

We propose a hierarchical image classification which classifies images first based on their contents and then refine each category into subcategories based on their metadata. Images that do not have metadata will be grouped only by content. Such a hierarchical classification gives a higher priority to image content and minimizes the influence of missing tags. Note that it is possible that some images are included in multiple categories as long as they contain the typical content features or metadata based classification groups’ images into subcategories under aforementioned baseline categories.

The process consists of three main steps.

The third step is to find a subcategory that an image belongs to. This is an incremental procedure. At the beginning, the first image forms a subcategory as itself and the representative hypernyms of the image becomes the subcategory’s representative hypernyms. Then, we compute the distance between representative hypernyms of a new incoming image and each existing subcategory.

CONTENT-BASED CLASSIFICATION:

Our approach to content-based classification is based on an efficient and yet accurate image similarity approach. Specifically, our classification algorithm compares image signatures defined based on quantified and sanitized version of Haar wavelet transformation. For each image, the wavelet transform encodes frequency and spatial information related to image color, size, invariant transform, shape, texture, symmetry, etc. Then, a small number of coefficients are selected to form the signature of the image. The content similarity among images is then determined by the distance among their image signatures.

Our selected similarity criteria include texture, symmetry, shape (radial symmetry and phase congruency and SIFT. We also account for color and size. We set the system to start from five generic image classes: (a) explicit (e.g., nudity, violence, drinking etc), (b) adults, (c) kids, (d) scenery (e.g., beach, mountains), (e) animals. As a preprocessing step, we populate the five baseline classes by manually assigning to each class a number of images crawled from Google images, resulting in about 1,000 images per class. Having a large image data set beforehand reduces the chance of misclassification. Then, we generate signatures of all the images and store them in the database.

Our content classifier, we conducted some preliminary test to evaluate its accuracy. Precisely, we tested our classifier it against a ground-truth data set, Image-net.org. In Image-net, over 10 million images are collected and classified according to the wordnet structure. For each image class, we use the first half set of images as the training data set and classify the next 800 images. The classification result was recorded as correct if the synset’s main search term or the direct hypernym is returned as a class. The average accuracy of our classifier is above 94 percent.

ADAPTIVE POLICY PREDICTION:

The policy prediction algorithm provides a predicted policy of a newly uploaded image to the user for his/her reference. More importantly, the predicted policy will reflect the possible changes of a user’s privacy concerns. The prediction process consists of three main phases: (i) policy normalization; (ii) policy mining; and (iii) policy prediction. The policy normalization is a simple decomposition process to convert a user policy into a set of atomic rules in which the data (D) component is a single-element set.

We propose a hierarchical mining approach for policy mining. Our approach leverages association rule mining techniques to discover popular patterns in policies. Policy mining is carried out within the same category of the new image because images in the same category are more likely under the similar level of privacy protection. The basic idea of the hierarchical mining is to follow a natural order in which a user defines a policy.

Given an image, a user usually first decides who can access the image, then thinks about what specific access rights (e.g., view only or download) should be given, and finally refine the access conditions such as setting the expiration date. Correspondingly, the hierarchical mining first look for popular subjects defined by the user, then look for popular actions in the policies containing the popular subjects, and finally for popular conditions in the policies containing both popular subjects and conditions.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

A3P-Social, we achieve a much higher accuracy, demonstrating that just simply considering privacy inclination is not enough, and that ”social-context” truly matters. Precisely the overall accuracy of A3P-social is above 95 percent. For 88.6 percent of the users, all predicted policies are correct, and the number of missed policies is 33 (for over 2,600 predictions). Also, we note that in this case, there is no significant difference across image types.

We compared the performance of the A3P-Social with alternative, popular, recommendation methods: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. In our case, the vectors are the users’ attributes defining their social profile. The algorithm using Cosine similarity scans all users profiles, computes Cosine similarity of the social contexts between the new user and the existing users. Then, it finds the top two users with the highest similarity score with the candidate user and feeds the associated images to the remaining functions in the A3P-core.

We have proposed an Adaptive Privacy Policy Prediction (A3P) system that helps users automate the privacy policy settings for their uploaded images. The A3P system provides a comprehensive framework to infer privacy preferences based on the information available for a given user. We also effectively tackled the issue of cold-start, leveraging social context information. Our experimental study proves that our A3P is a practical tool that offers significant improvements over current approaches to privacy.

Predicting Asthma-Related Emergency Department Visits Using Big Data

05/08/201902/07/2019 by admin

Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% precision. The results can be helpful for public health surveillance, emergency department preparedness, and, targeted patient interventions.

1.2 INTRODUCTION:

Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.

Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.

Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

Another notable non-traditional disease surveillance systemhas been a data-aggregating tool called Google Flu Trends which uses aggregated search data to estimate flu activity. Google Trends was quite successful in its estimation of influenza-like illness. It is based on Google’s search engine which tracks how often a particular search-term is entered relative to the total search-volume across a particular area. This enables access to the latest data from web search interest trends on a variety of topics, including diseases like asthma. Air pollutants are known triggers for asthma symptoms and exacerbations. The United States Environmental Protection Agency (EPA) provides access to monitored air quality data collected at outdoor sensors across the country which could be used as a data source for asthma prediction. Meanwhile, as health reform progresses, the quantity and variety of health records being made available electronically are increasing dramatically. In contrast to traditional disease surveillance systems, these new data sources have the potential to enable health organizations to respond to chronic conditions, like asthma, in real time. This in turn implies that health organizations can appropriately plan for staffing and equipment availability in a flexible manner. They can also provide early warning signals to the people at risk for asthma adverse events, and enable timely, proactive, and targeted preventive and therapeutic interventions.

1.3 LITRATURE SURVEY:

USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION

AUTHOR: Kim, Eui-Ki, et al.

PUBLISH: PloS one vol. 8, no.7, e69305, 2013.

EXPLANATION:

Influenza epidemics arise through the accumulation of viral genetic changes. The emergence of new virus strains coincides with a higher level of influenza-like illness (ILI), which is seen as a peak of a normal season. Monitoring the spread of an epidemic influenza in populations is a difficult and important task. Twitter is a free social networking service whose messages can improve the accuracy of forecasting models by providing early warnings of influenza outbreaks. In this study, we have examined the use of information embedded in the Hangeul Twitter stream to detect rapidly evolving public awareness or concern with respect to influenza transmission and developed regression models that can track levels of actual disease activity and predict influenza epidemics in the real world. Our prediction model using a delay mode provides not only a real-time assessment of the current influenza epidemic activity but also a significant improvement in prediction performance at the initial phase of ILI peak when prediction is of most importance.

A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS

AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.

PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.

EXPLANATION:

Traditional disease surveillance is a very time consuming reporting process. Cases of notifiable diseases are reported to the different levels in the national health care system before actions can be taken. But, early detection of disease activity followed by a rapid response is crucial to reduce the impact of epidemics. To address this challenge, alternative sources of information are investigated for disease surveillance. In this paper, the relevance of twitter messages outbreak detection is investigated from two directions. First, Twitter messages potentially related to disease outbreaks are retrospectively searched and analyzed. Second, incoming twitter messages are assessed with respect to their relevance for outbreak detection. The studies show that twitter messages can be – to a certain extent – highly relevant for early detecting hints to public health threats. According to the law on German Protection against Infection Act (Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance relies on data from mandatory reporting of cases by physicians and laboratories. They inform local county health departments (Landkreis) which in turn report to state health departments (Land). At the end of the reporting pipeline, the national surveillance institute (Robert Koch Institute) is informed about the outbreak. It is clear that these different stages of reporting take time and delay a timely reaction.

TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES

AUTHOR: Culotta, Aron.

PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.

EXPLANATION:

Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.

Google messages and their relevance for disease outbreak detection has been reported already that especially tweets are useful to predict outbreaks such as a Norovirus outbreak at a university analysed twitter news during the influenza epidemic 2009. They compared the use of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed the content of the tweets (ten content concepts) and validated twitter as a the real time content. They analysed the data via Infovigil an infosurveillance system by using an automated coding. To find out if there is a relationship between automated and manual coding, the tweets were evaluated by a Pearson´s correlation. Chew et al. found a significant correlation between both coding in seven content concept it needs to be investigated whether this source might be of relevance for detecting disease outbreaks in Germany. Therefore, only German keywords are exploited to identify Twitter messages. Further, we are not only interested in influenza-like illnesses as the studies available so far, but also in other infectious diseases (e.g. Norovirus and Salmonella).

2.1.1 DISADVANTAGES:

Existing methods have a common format:

[username]

[text] [date time client]. The length is restricted to 140 characters. In terms of linguistics, each twitter user can write as he or she likes. Thus, the variety reaches from complete sentences to listing of keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote topics and are primarily utilized by experienced users categories google according to their contents in more details, google messages can • Provide information, • Express opinions or • Report personal issues is provided, the authority of that information cannot normally not be determined, so it might be unverified information. Opinions are often expressed with humor or sarcasm and may be highly contradictive in the emotions that are expressed.

2.2 PROPOSED SYSTEM:

Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

Research studies have explored the use of novel data sources to propose rapid, cost-effective health status surveillance methodologies. Some of the early studies rely on document classification suggesting that Twitter data can be highly relevant for early detection of public health threats. Others employ more complex linguistic analysis, such as the Ailment Topic Aspect Model which is useful for syndrome surveillance. This type of analysis is useful for demonstrating the significance of social media as a promising new data source for health surveillance. Other recent studies have linked social media data with real world disease incidence to generate actionable knowledge useful for making health care decisions. These include which analyzed Twitter messages related to influenza and correlated them with reported CDC statistics validated Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Collier employed supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories. This study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data.

2.2.1 ADVANTAGES:

Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.

The main contributions of this work are:

• Analysis of tweets with respect to their relevance for disease surveillance,

• Content analysis and content classification of tweets,

• Linguistic analysis of disease-reporting twitter messages,

• Recommendations on search patterns for tweet search in the context of disease surveillance.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

Alert Email

Filter Tweet

Asthma Tweets

New Tweet

Friend Follow

Friends list

CHAPTER 4

4.0 IMPLEMENTATION:

DISEASE CONTROL AND PREVENTION (CDC):

Current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments [4]. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that are not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection.

Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information.

4.1 ALGORITHM:

MACHINE LEARNING ALGORITHMS:

Our research objective is to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days). To this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

4.2 MODULES:

EMERGENCY DEPARTMENT VISITS:

ENVIRONMENTAL SENSORS (EMR):

OUR PREDICTION SENSOR DATA:

ASTHMA PREDICTION RESULTS:

4.3 MODULE DESCRIPTION:

EMERGENCY DEPARTMENT VISITS:

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers.

The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics.

ENVIRONMENTAL SENSORS (EMR):

Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment.

For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

OUR PREDICTION SENSOR DATA:

We first analyzed the association between the asthma-related ED visits and data from Twitter, Google trends, and Air Quality sensors, using the Pearson correlation coefficient. We also examined the association between asthma-related tweet counts and ED visit counts for abdominal pain/constipation patients, to control for non-asthma-specific variations in ED visit counts. Then, we designed and implemented a prediction model to estimate the incidence of asthma ED visits at CMC using a combination of independent variables from the above data sources.

Twitter offers streaming APIs to give developers and researchers low latency access to its global stream of data. Public streams, which can provide access to the public data flowing through Twitter, were used in this study. Studies have estimated that using Twitter’s Streaming API, researchers can expect to receive 1% of the tweets in near real-time. Twitter4j, an unofficial Java library for the Twitter API, was used to access tweet information from the Twitter Streaming API.

Two different Twitter data sets were collected in this study:

(1) The general twitter stream: a simple collection of JSON grabbed from the general Twitter stream. The general tweet counts were used to estimate the Twitter population and further normalize asthma tweet counts.

(2) The asthma-related stream: to collect only tweets containing any of 19 related keywords that were suggested by our clinical collaborators from PCCI. The asthma stream is limited to 1% of full tweets as well.

ASTHMA PREDICTION RESULTS:

Our results from the correlation analysis, asthma tweets, CO, NO2 and PM2.5 were selected as inputs into our prediction model. We are only reporting results for the Decision Tree and Artificial Neural Networks (ANN) techniques, as the Naive Bayes and SVM techniques did not yield good prediction results. First, backward feature selection algorithm was used to examine if the addition of Twitter data would improve the prediction. As shown in Table VI, combining air quality data with tweets resulted in higher prediction accuracy. Additionally, we evaluated prediction precision. Given that our prediction task is for a three-way classification, each technique resulted in different prediction and/or precision in different classes (Table VII). Decision Tree performed well in predicting the “High” class, while ANN, after Adaptive Boosting, worked well for the “Low” class. Stacking the two techniques performed well for the “Medium” class.

The results of our analysis are promising because they perform with a fairly high level of accuracy overall. As noted in the introduction, traditional asthma ED visit models are useful for predicting events in a three month window and have an accuracy of approximately 70%. It is to be noted that “traditional models” estimate a risk score for asthma ED visit for each individual patient, whereas our “Twitter/ Environmental data model” predicts the risk for a daily number of ED visits being High, Low, or Medium. The former is an individual-level risk model, while the latter is a population-level risk model. Our population-level asthma risk prediction model has the potential for complementing current individual-level models, and may lead to a shorter time window and better accuracy of prediction. This in turn has implications for better planning and proactive treatment in specific geo-locations at specific time periods.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

In this study, we have provided preliminary evidence that social media and environmental data can be leveraged to accurately predict asthma ED visits at a population level. We are in the process of confirming these preliminary findings by collecting larger clinical datasets across different seasons and multiple hospitals. Our continued work is focused on extending this research to propose a temporal prediction model that analyzes the trends in tweets and air quality index changes, and estimates the time lag between these changes and the number of asthma ED visits.

We also are collecting air quality index data over a longer time period to examine the effects of seasonal variations. In addition, we would like to explore the effect of relevant data from other types of social media interactions, e.g. blogs and discussion forums, on our asthma visit prediction model. Additional studies are needed to examine how combining real-time or near-real-time social media and environmental data with more traditional data might affect the performance and timing of current individual-level prediction models for asthma, and eventually, for other chronic conditions. In future projects, we intend to extend our work to diseases with geographical and temporal variability, e.g., COPD and diabetes.

Performing Initiative Data Prefetching in Distributed File Systems for Cloud Computing

05/08/201902/07/2019 by admin

1.2 INTRODUCTION

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

This paper makes the following two contributions:

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter

05/08/201902/07/2019 by admin

It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.

This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.

This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.

These results can help further reveal IP spoofing, which has been studied for long but never well understood. Though PIT cannot work in all the spoofing attacks, it may be the most useful mechanism to trace spoofers before an Internet-level traceback system has been deployed in real.

1.2 INTRODUCTION

IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.

To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.

Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

Our work has the following contributions:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities. Though Moore et al. [8] has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback. 2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real. 3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

1.3 LITRATURE SURVEY

DEFENSE AGAINST SPOOFED IP TRAFFIC USING HOP-COUNT FILTERING

PUBLICATION: IEEE/ACM Trans. Netw., vol. 15, no. 1, pp. 40–53, Feb. 2007.

AUTHORS: H. Wang, C. Jin, and K. G. Shin

EXPLANATION:

IP spoofing has often been exploited by Distributed Denial of Service (DDoS) attacks to: 1)conceal flooding sources and dilute localities in flooding traffic, and 2)coax legitimate hosts into becoming reflectors, redirecting and amplifying flooding traffic. Thus, the ability to filter spoofed IP packets near victim servers is essential to their own protection and prevention of becoming involuntary DoS reflectors. Although an attacker can forge any field in the IP header, he cannot falsify the number of hops an IP packet takes to reach its destination. More importantly, since the hop-count values are diverse, an attacker cannot randomly spoof IP addresses while maintaining consistent hop-counts. On the other hand, an Internet server can easily infer the hop-count information from the Time-to-Live (TTL) field of the IP header. Using a mapping between IP addresses and their hop-counts, the server can distinguish spoofed IP packets from legitimate ones. Based on this observation, we present a novel filtering technique, called Hop-Count Filtering (HCF)-which builds an accurate IP-to-hop-count (IP2HC) mapping table-to detect and discard spoofed IP packets. HCF is easy to deploy, as it does not require any support from the underlying network. Through analysis using network measurement data, we show that HCF can identify close to 90% of spoofed IP packets, and then discard them with little collateral damage. We implement and evaluate HCF in the Linux kernel, demonstrating its effectiveness with experimental measurements

DYNAMIC PROBABILISTIC PACKET MARKING FOR EFFICIENT IP TRACEBACK

PUBLICATION: Comput. Netw., vol. 51, no. 3, pp. 866–882, 2007.

AUTHORS: J. Liu, Z.-J. Lee, and Y.-C. Chung

EXPLANATION:

Recently, denial-of-service (DoS) attack has become a pressing problem due to the lack of an efficient method to locate the real attackers and ease of launching an attack with readily available source codes on the Internet. Traceback is a subtle scheme to tackle DoS attacks. Probabilistic packet marking (PPM) is a new way for practical IP traceback. Although PPM enables a victim to pinpoint the attacker’s origin to within 2–5 equally possible sites, it has been shown that PPM suffers from uncertainty under spoofed marking attack. Furthermore, the uncertainty factor can be amplified significantly under distributed DoS attack, which may diminish the effectiveness of PPM. In this work, we present a new approach, called dynamic probabilistic packet marking (DPPM), to further improve the effectiveness of PPM. Instead of using a fixed marking probability, we propose to deduce the traveling distance of a packet and then choose a proper marking probability. DPPM may completely remove uncertainty and enable victims to precisely pinpoint the attacking origin even under spoofed marking DoS attacks. DPPM supports incremental deployment. Formal analysis indicates that DPPM outperforms PPM in most aspects.

FLEXIBLE DETERMINISTIC PACKET MARKING: AN IP TRACEBACK SYSTEM TO FIND THE REAL SOURCE OF ATTACKS

PUBLICATION: EEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.

AUTHORS: Y. Xiang, W. Zhou, and M. Guo

EXPLANATION:

IP traceback is the enabling technology to control Internet crime. In this paper we present a novel and practical IP traceback system called Flexible Deterministic Packet Marking (FDPM) which provides a defense system with the ability to find out the real sources of attacking packets that traverse through the network. While a number of other traceback schemes exist, FDPM provides innovative features to trace the source of IP packets and can obtain better tracing capability than others. In particular, FDPM adopts a flexible mark length strategy to make it compatible to different network environments; it also adaptively changes its marking rate according to the load of the participating router by a flexible flow-based marking scheme. Evaluations on both simulation and real system implementation demonstrate that FDPM requires a moderately small number of packets to complete the traceback process; add little additional load to routers and can trace a large number of sources in one traceback process with low false positive rates. The built-in overload prevention mechanism makes this system capable of achieving a satisfactory traceback result even when the router is heavily loaded. It has been used to not only trace DDoS attacking packets but also enhance filtering attacking traffic.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods of the IP marking approach is that routers probabilistically write some encoding of partial path information into the packets during forwarding. A basic technique, the edge sampling algorithm, is to write edge information into the packets. This scheme reserves two static fields of the size of IP address, start and end, and a static distance field in each packet. Each router updates these fields as follows. Each router marks the packet with a probability. When the router decides to mark the packet, it writes its own IP address into the start field and writes zero into the distance field. Otherwise, if the distance field is already zero which indicates its previous router marked the packet, it writes its own IP address into the end field, thus represents the edge between itself and the previous routers.

Previous router doesn’t mark the packet, then it always increments the distance field. Thus the distance field in the packet indicates the number of routers the packet has traversed from the router which marked the packet to the victim. The distance field should be updated using a saturating addition, meaning that the distance field is not allowed to wrap. The mandatory increment of the distance field is used to avoid spoofing by an attacker. Using such a scheme, any packet written by the attacker will have distance field greater than or equal to the length of the real attack path a router false positive if it is in the reconstructed attack graph but not in the real attack graph. Similarly we call a router false negative if it is in the true attack graph but not in the reconstructed attack graph. We call a solution to the IP traceback problem robust if it has very low rate of false negatives and false positives.

2.1.1 DISADVANTAGES:

Existing approach has a very high computation overhead for the victim to reconstruct the attack paths, and gives a large number of false positives when the denial-of-service attack originates from multiple attackers.

Existing approach can require days of computation to reconstruct the attack paths and give thousands of false positives even when there are only 25 distributed attackers. This approach is also vulnerable to compromised routers.

If a router is compromised, it can forge markings from other uncompromised routers and hence lead the reconstruction to wrong results. Even worse, the victim will not be able to tell a router is compromised just from the information in the packets it receives problem.

2.2 PROPOSED SYSTEM:

We propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

We present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

2.2.1 ADVANTAGES:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback.

2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real.

3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM:

LEVEL 1

Base station

View request

Router check the node

Message send via router

LEVEL 2

Node

Exists

Send request

Receive message

Check IP Address & check verification node

Clear Spoofing Attacks

LEVEL 3

Router

IP Address

Router check the each node

Check verification same/diff node to each data

Response to node

Detect Spoofing Origin and send message to original node

3.2.1 UML DIAGRAMS:

3.2.2 USE CASE DIAGRAM:

Base station

Router

Create message

View request

Message send via router

Router check each node

Check verification same/diff node to each data

Response to client

Detect spoofing origin

Send message

Node

Send request

3.2.3 CLASS DIAGRAM:

Node

IP Adress

Send request

View message ()

Base station

IP Address

View request

Send message via router

Socket connection () ()

Send message () ()

Router

IP Address

Router check the each node

Detectsppofing() ()

Receive message ()

Response to nde

Send message() ()

3.2.4 SEQUENCE DIAGRAM:

Connection established

Send encoded data

Check verification

Form routing

Routing Finished

Detect Spoofing

Connection terminate

Source

Base station

Destination

Establish communication

Connection established

Receiving Ack

Data received

Routing Success

3.2. ACTIVITY DIAGRAM:

Node

Router

Check

Check verification same/diff node to each data

Router check the each node

Clear jamming and send message to node

Response to client

Yes Start msg receive

IP Address & View request

IP Address

Send request

Message Received

Base station

Message send via router

Router check the node

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

We designed an algorithm specified in Fig. 6. This algorithm first finds a shortest path from r to od. From the second vertex along the path, it checks if the removal of the vertex can break r and od. Whenever such a vertex c is found, removing the vertex from G, and the set containing all the verticals which are still connected with r is just the suspect set.

4.2 MODULES:

NETWORK SECURITY:

DENIAL OF SERVICE (DOS):

PATH BACKSCATTER:

IP SPOOFING METHOD:

IP TRACEBACK METHOD:

4.3 MODULE DESCRIPTION:

NETWORK SECURITY:

DENIAL OF SERVICE (DOS):

In computing, a denial-of-service (DoS) attack is an attempt to make a machine or network resource unavailable to its intended users, such as to temporarily or indefinitely interrupt or suspend services of a host connected to the Internet. A distributed denial-of-service (DDoS) is where the attack source is more than one, often thousands of, unique IP addresses. It is analogous to a group of people crowding the entry door or gate to a shop or business, and not letting legitimate parties enter into the shop or business, disrupting normal operations.

Criminal perpetrators of DoS attacks often target sites or services hosted on high-profile web servers such as banks, credit card payment gateways; but motives of revenge, blackmail or activism can be behind other attacks. A denial-of-service attack is characterized by an explicit attempt by attackers to prevent legitimate users of a service from using that service. There are two general forms of DoS attacks: those that crash services and those that flood services.

The most serious attacks are distributed and in many or most cases involve forging of IP sender addresses (IP address spoofing) so that the location of the attacking machines cannot easily be identified, nor can filtering be done based on the source address.

PATH BACKSCATTER:

We presented a preliminary statistical result on path backscatter messages and discussed it is possible to trace spoofers based on the messages. However, the generation and collection of path backscatter messages are not well investigated, and the traceback mechanisms are not designed. In this article, we make a thorough analysis on path backscatter messages, present the traceback mechanisms and give the traceback results. 2. Each message contains the source address of the reflecting device, and the IP header of the original packet. Thus, from each path backscatter, we can get 1) the IP address of the reflecting device which is on the path from the attacker to the destination of the spoofing packet; 2) the IP address of the original destination of the spoofing packet. The original IP header also contains other valuable information, e.g., the remaining TTL of the spoofing packet. Note that due to some network devices may perform address rewrite (e.g., NAT), the original source address and the destination address may be different.

IP SPOOFING METHOD:

Our tracking mechanisms actually have two limitations. The first is the network topology and mapping from addresses of r and od must be known. The second is the tracking is actually performed based on loose assumptions on paths. Thus, only when path backscatter messages are from very special vertex, i.e., stub AS, the spoofer can be accurately located. In this section, we discuss how to break these limitations through using other information contained in path backscatter messages.

We found there are three special types of path backscatter messages which are more useful for tracing spoofers:

1) The path backscatter messages whose original hop count is 0 or 1. Such messages are generated 1 or 2 hops from the spoofers. Very possibly they are from the gateway of the spoofer.

2) The path backscatter messages whose type is ‘Redirect’. Such messages must be from a gateway of the spoofer.

3) The path backscatter messages whose original destination is a private address or unallocated address. Such messages are typically generated by the first DFZ router on the path from the spoofer to the original destination, e.g., the egress router of the AS in which the spoofer resides. Though such path backscatter messages are generated in very special cases, they are not rare. Especially, there are a large number of path backscatter messages whose original destination is a private address.

IP TRACEBACK METHOD:

PIT is very different from any existing traceback mechanism. The main difference is the generation of path backscatter message is not of a certain probability. Thus, we separate the evaluation into 3 parts: the first is the statistical results on path backscatter messages; the second is the evaluation on the traceback mechanisms presented in considering uncertainness of path backscatter generation, since effectiveness of the mechanisms is actually determined by the structure features of the networks; the last is the result of performing the traceback mechanisms on the path backscatter message dataset.

In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known. We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION:

We try to dissipate the mist on the the locations of spoofers based on investigating the path backscatter messages. In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known.

We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

Optimal Configuration of Network Coding in Ad Hoc Networks

05/08/201902/07/2019 by admin

Abstract:

Analyze the impact of network coding (NC) configuration on the performance of ad hoc networks with the consideration of two significant factors, namely, the throughput loss and the decoding loss, which are jointly treated as the overhead of NC. In particular, physical-layer NC and random linear NC are adopted in static and mobile ad hoc networks (MANETs), respectively. Furthermore, we characterize the good put and delay/good put tradeoff in static networks, which are also analyzed in MANETs for different mobility models (i.e., the random independent and identically distributed mobility model and the random walk model) and transmission schemes.

Introduction:

Network coding was initially designed as a kind of Source coding. Further studies showed that the Capacity of wired networks can be improved by network coding (NC), which can fully utilize the network resources.

Due to This advantage, how to employ NC in wireless ad hoc networks has been intensively studied in recent years with the Purpose of improving the throughput and delay performance. The main difference between wired networks and wireless Networks is that there is non ignorable interference between Nodes in wireless networks.

Therefore, it is important to design the NC in wireless ad hoc networks with interference to achieve the improvement on system performance such as good put and delay/good put tradeoff.

Existing System:

The probability that the random linear NC was valid for a multicast connection problem on an arbitrary network with independent sources was at least (1 − d/q)η, where η was the number of links with associated random coefficients, d was the number of receivers, and q was the size of Galois field Fq.

It was obvious that a large q was required to guarantee that the system with RLNC was valid. When considering the given two factors, the traditional definition of throughput in ad hoc networks is no longer appropriate since it does not consider the bits of NC coefficients and the linearly correlated packets that do not carry any valuable data. Instead, the good put and the delay/good put tradeoff are investigated in this paper, which only take into account the successfully decoded data.

Moreover, if we treat the data size of each packet, the generation size (the number of packets that are combined by NC as a group), and the NC coefficient Galois field as the configuration of NC, it is necessary to find the scaling laws of the optimal configuration for a given network model and transmission scheme.

Disadvantages:

Throughput loss.
The decoding loss.
Time delay.

Proposed System:

Proposed system with the basic idea of NC and the scaling laws of throughput loss and decoding loss. Furthermore, some useful concepts and parameters are listed. Finally, we give the definitions of some network performance metrics.

Physical layer Network Coding designed based on the channel state information (CSI) and network topology. The PNC is appropriate for the static networks since the CSI and network topology are preknown in the static case.

There are G nodes in one cell, and node i (i = 1, 2, . . . , G) holds packet xi. All of the G packets are independent, and they belong to the same unicast session. The packets are transmitted to a node i’ in the next cell simultaneously. gii’ is a complex number that represents the CSI between i and i’ in the frequency domain.

Advantages:

System minimizes data loss.
System reduces time delay.

Modules:

Network Topology:

The networks that consist of n randomly and evenly distributed static nodes in a unit square area. These nodes are randomly grouped into S–D pairs.

Transmission Model:

The protocol model, which is a simplified version of the physical model since it ignores the long-distance interference and transmission. Moreover, it is indicated in that the physical model can be treated as the protocol model on scaling laws when the transmission is allowed if the signal-to-interference-plus-noise ratio is larger than a given threshold.

Transmission Schemes for Mobile Networks:

When the relay receives a new packet, it combines the packet it has with that it receives by randomly selected coefficients and then generates a new packet. Simultaneous transmission in one cell is not allowed since it is hard for the receiver to obtain multiple CSI from different transmitters at the same time. Hence, we employ the random linear NC for mobile models.

Conclusion:

Analyzed the NC configuration in both static and mobile ad hoc networks to optimize the delay good put tradeoff and the good put with the consideration of the

Through put loss and decoding loss of NC. These results reveal the impact of network scale on the NC system, which has not been studied in previous works. Moreover, we also compared the performance with the corresponding networks without NC.