SECURITY
OPTIMIZATION OF DYNAMIC NETWORKS WITH PROBABILISTIC GRAPH MODELING AND LINEAR
PROGRAMMINGByAPROJECT REPORTSubmitted to the
Department of Computer Science & Engineering in the
FACULTY OF ENGINEERING & TECHNOLOGYIn partial fulfillment of the requirements for the award of the degreeOfMASTER
OF TECHNOLOGYINCOMPUTER SCIENCE & ENGINEERINGAPRIL 2016
CERTIFICATECertified
that this project report titled “SECURITY
OPTIMIZATION OF DYNAMIC NETWORKS WITH PROBABILISTIC GRAPH MODELING AND LINEAR
PROGRAMMING” is the bonafide work of Mr. _____________Who carried out the research
under my supervision Certified further, that to the best of my knowledge the
work reported herein does not form part of any other project report or
dissertation on the basis of which a degree or award was conferred on an
earlier occasion on this or any other candidate.Signature of the Guide Signature of the H.O.DName Name
DECLARATIONI hereby declare that the project work entitled “SECURITY
OPTIMIZATION OF DYNAMIC NETWORKS WITH PROBABILISTIC GRAPH MODELING AND LINEAR
PROGRAMMING”
Submitted to BHARATHIDASAN UNIVERSITY in partial fulfillment of the requirement
for the award of the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE is a
record of original work done by me the guidance of Prof.A.Vinayagam
M.Sc., M.Phil., M.E., to the best of my knowledge, the work reported here
is not a part of any other thesis or work on the basis of which a degree or
award was conferred on an earlier occasion to me or any other candidate.
(Student
Name)
(Reg.No)Place:Date:
ACKNOWLEDGEMENTI am extremely glad to present my project “SECURITY
OPTIMIZATION OF DYNAMIC NETWORKS WITH PROBABILISTIC GRAPH MODELING AND LINEAR
PROGRAMMING”
which is a part of my curriculum of third semester Master of Science in
Computer science. I take this opportunity to express my sincere gratitude to
those who helped me in bringing out this project work.I would like to express
my Director,
Dr. K. ANANDAN, M.A.(Eco.), M.Ed., M.Phil.,(Edn.), PGDCA., CGT., M.A.(Psy.) of who had
given me an opportunity to undertake this project.I am highly indebted to Co-Ordinator
Prof. Muniappan Department of Physics and thank from my deep heart for her
valuable comments I received through my project.I wish to express my deep sense
of gratitude to my guide
Prof. A.Vinayagam M.Sc., M.Phil., M.E., for
her immense help and encouragement for successful completion of this project.I
also express my sincere thanks to the all the staff members of Computer
science for their kind advice.And last, but not the least, I express my deep
gratitude to my parents and friends for their encouragement and support
throughout the project.CHAPTER 11.1
ABSTRACT:Securing the networks of large organizations is technically
challenging due to the complex configurations and constraints. Managing these
networks requires rigorous and comprehensive analysis tools. A network
administrator needs to identify vulnerable configurations, as well as tools for
hardening the networks. Such networks usually have dynamic and fluidic
structures, thus one may have incomplete information about the connectivity and
availability of hosts. In this paper, we address the problem of statically
performing a rigorous assessment of a set of network security defense
strategies with the goal of reducing the probability of a successful
large-scale attack in dynamically changing and complex network architecture. We
describe a probabilistic graph model and algorithms for analyzing the security
of complex networks with the ultimate goal of reducing the probability of successful
attacks. Our model naturally utilizes a scalable state-of-the-art optimization
technique called sequential linear programming that is extensively applied and
studied in various engineering problems. In comparison to related solutions on
attack graphs, our probabilistic model provides mechanisms for expressing
uncertainties in network configurations, which is not reported elsewhere. We
have performed comprehensive experimental validation with real-world network
configuration data of a sizable organization.
1.2 INTRODUCTIONBRACER:
A Distributed Broadcast Protocolin Multi-Hop Cognitive Radio Ad HocNetworks
with Collision AvoidanceYi Song and Jiang Xie, Senior Member, IEEEAbstract—Broadcast
is an important operation in wireless ad hoc networks where control information
is usually propagated asbroadcasts for the realization of most networking
protocols. In traditional ad hoc networks, since the spectrum availability is
uniform,broadcasts are delivered via a common channel which can be heard by all
users in a network. However, in cognitive radio (CR) ad hocnetworks, different
unlicensed users may acquire different available channel sets. This non-uniform
spectrum availability imposesspecial design challenges for broadcasting in CR
ad hoc networks. In this paper, a fully-distributed Broadcast protocol in multi-hopCognitive Radio ad hoc networks
with collision avoidance, BRACER, is proposed. In our design, we consider
practical scenarios thateach unlicensed user is not assumed to be aware of the
global network topology, the spectrum availability information of other users,and
time synchronization information. By intelligently downsizing the original
available channel set and designing the broadcastingsequences and scheduling
schemes, our proposed broadcast protocol can provide very high successful
broadcast ratio while achievingvery short average broadcast delay. It can also
avoid broadcast collisions. To the best of our knowledge, this is the first
work thataddresses the unique broadcasting challenges in multi-hop CR ad hoc
networks with collision avoidance.Index Terms—Cognitive radio ad hoc networks, distributed broadcast, channel
hopping, broadcast collision avoidanceÇ1
INTRODUCTIONCOGNITIVE radio (CR) technology has been proposed asan enabling solution
to alleviate the spectrum underutilizationproblem [1]. With the capability of
sensing the frequencybands in a time and location-varying spectrumenvironment
and adjusting the operating parameters basedon the sensing outcome, CR
technology allows an unlicenseduser (or, secondary user (SU)) to exploit those
frequencybands unused by licensed users (or, primary users)in an opportunistic
manner [2]. Secondary users can form aCR infrastructure-based network or a CR
ad hoc network.Recently, CR ad hoc networks have attracted plentifulresearch
attention due to their various applications [3], [4].Broadcast is an important
operation in ad hoc networks,especially in distributed multi-hop multi-channel
networks.Control information exchange among nodes, such as channelavailability
and routing information, is crucial for therealization of most networking
protocols in an ad hoc network.This control information is often sent out as
networkwidebroadcasts, messages that are sent to all other nodes ina network.
In addition, some exigent data packets such asemergency messages and alarm
signals are also deliveredas network-wide broadcasts [5]. Due to the importance
ofthe broadcast operation, in this paper, we address thebroadcasting issue in
multi-hop CR ad hoc networks. Sincebroadcast messages often need to be
disseminated to all destinationsas quickly as possible, we aim to achieve very
highsuccessful broadcast ratio and very short broadcast delay.The broadcasting
issue has been studied extensively intraditional ad hoc networks [6], [7], [8],
[9]. However, unliketraditional single-channel or multi-channel ad hoc networkswhere
the channel availability is uniform, in CR ad hoc networks,different SUs may
acquire different sets of availablechannels. This non-uniform channel
availability imposesspecial design challenges for broadcasting in CR ad hoc
networks.First of all, for traditional single-channel and multichannelad hoc
networks, due to the uniformity of channelavailability, all nodes can tune to
the same channel. Thus,broadcast messages can be conveyed through a single
commonchannel which can be heard by all nodes in a network.However, in CR ad
hoc networks, the availability of a commonchannel for all nodes may not exist.
More importantly,before any control information is exchanged, a SU isunaware of
the available channels of its neighboring nodes.Therefore, broadcasting
messages on a global commonchannel is not feasible in CR ad hoc networks.To
further illustrate the challenges of broadcasting in CRad hoc networks, we
consider a single-hop scenario shownin Fig. 1, where node A is
the source node. For traditionalsingle-channel and multi-channel ad hoc
networks, asshown in Fig. 1a, nodes can tune to the same channel (e.g.,channel
1) for broadcasting. Thus, node A only needs onetime slot to let all
its neighboring nodes receive the broadcastmessage in an error-free
environment. However, in CRad hoc networks where the channel availability is
heterogeneousand SUs are unaware of the available channels of_ Y.
Song is with the Department of Electrical Engineering and ComputerScience, Wichita
State University, Wichita, KS 67260.E-mail: yi.song@wichita.edu._ J.
Xie is with the Department of Electrical and Computer Engineering,University of
North Carolina at Charlotte, 9201 University City Blvd.,Charlotte, NC
28223-0001. E-mail: linda.xie@uncc.edu.Manuscript received 22 Feb. 2012;
revised 22 Apr. 2014; accepted 20 May2014. Date of publication 4 June 2014;
date of current version 29 Jan. 2015.For information on obtaining reprints of
this article, please send e-mail to:reprints@ieee.org, and reference the
Digital Object Identifier below.Digital Object Identifier no.
10.1109/TMC.2014.2328998IEEE
TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH 2015 5091536-1233 _ 2014
IEEE. Personal use is permitted, but republication/redistribution requires IEEE
permission.See
http://www.ieee.org/publications_standards/publications/rights/index.html for
more information.each
other, as shown in Fig. 1b, node A may
have to usemultiple channels for broadcasting and may not be able tofinish the
broadcast within one time slot. In fact, the exactbroadcast delay for all
single-hop neighboring nodes to successfullyreceive the broadcast message in CR
ad hoc networksrelies on various factors (e.g., channel availabilityand the
number of neighboring nodes) and it is random.Furthermore, since multiple
channels may be used forbroadcasting and the exact time for all single-hop
neighboringnodes to successfully receive the broadcast message israndom, to
avoid broadcast collisions (i.e., a node receivesmultiple copies of the
broadcast message simultaneously) ismuch more complicated in CR ad hoc
networks, as comparedto traditional ad hoc networks. In traditional ad hocnetworks,
numerous broadcast scheduling schemes are proposedto reduce the probability of
broadcast collisions whileoptimizing the network performance [10], [11], [12],
[13],[14], [15]. All these proposals are on the basis that all nodesuse a
single channel for broadcasting and the exact delayfor a single-hop broadcast
is one time slot. However, in CRad hoc networks, without the information about
the channelused for broadcasting and the exact delay for a single-hopbroadcast,
to predict when and on which channel a broadcastcollision occurs is extremely
difficult. Hence, to designa broadcast protocol which can avoid broadcast
collisions,as well as provide high successful broadcast ratio and shortbroadcast
delay is a very challenging issue for multi-hopCR ad hoc networks under
practical scenarios. Simplyextending existing broadcast protocols to CR ad hoc
networkscannot yield the optimal performance.Currently, research on
broadcasting in multi-hop CRad hoc networks is still in its infant stage. There
are onlylimited papers addressing the broadcasting issue in CRad hoc networks
[16], [17], [18], [19]. However, in [16]and [17], the global network topology
and the availablechannel information of all SUs are assumed to be known.Additionally,
in [17], a common signaling channel for thewhole network is employed which is
also not practical.These two papers adopt impractical assumptions whichmake
them inadequate to be used in practical scenarios.In [18], a Quality-of-Service
(QoS)-based broadcast protocolunder blind information is proposed. However,
thisscheme does not consider optimizing the network performance.Moreover, it
ignores the broadcast collision issue.Other proposals aiming to locally
establish a commoncontrol channel may also be considered for broadcasting[20],
[21], [22], [23]. However, these proposals need a-priorichannel availability
information of all SUs which isusually obtained via broadcasts. In addition,
althoughsome schemes on channel hopping in CR networks can beused for finding a
common channel between two nodes[24], [25], [26], they still suffer various
limitations andcannot be used in broadcast scenarios. In [24] and [25],the
proposed channel hopping schemes cannot guaranteerendezvous under some special
circumstances. In addition,one of the proposed schemes in [24] only workswhen
two SUs have exactly the same available channelsets. Furthermore, in [26], a
jump-stay based channel hoppingalgorithm is proposed for guaranteed rendezvous.However,
the expected rendezvous time for the asymmetricmodel (i.e., different users
have different availablechannels) is of polynomial complexity with respect to
thetotal number of channels. Thus, it is unsuitable for broadcastscenarios in
CR ad hoc networks where channelavailability is usually non-uniform and short
broadcastdelay is often required. Other channel hopping algorithmsexplained in
[27] require tight time synchronizationwhich is also not feasible before any
controlinformation is exchanged.In this paper, a fully-distributed broadcast protocol in amulti-hop CR ad hoc network, BRACER, is
proposed. Weconsider practical scenarios in our design: 1) no global andlocal
common control channel is assumed to exist; 2) theglobal network topology is
not known; 3) the channel informationof any other SUs is not known; 4) the
available channelsets of different SUs are not assumed to be the same;and 5)
tight time synchronization is not required. Our proposedBRACER protocol can
provide very high successfuldelivery ratio while achieving very short broadcast
delay. Itcan also avoid broadcast collisions. To the best of ourknowledge, this
is the first work that addresses the broadcastingchallenges specifically in
multi-hop CR ad hoc networkswith a solution for broadcast collision avoidance.The
remainder of this paper is organized as follows.In Section 2, the proposed
broadcast protocol for multihopCR ad hoc networks, BRACER, is presented. Thederivation
of an important system parameter is given inSection 3. Two implementation
issues of the proposedprotocol are further discussed in Section 4. Simulationresults
are shown in Section 5, followed by the conclusionsin Section 6.2 THE PROPOSED
BRACER PROTOCOLIn this section, we introduce the
proposed broadcast protocolfor multi-hop CR ad hoc networks, BRACER. There arethree
components of the proposed BRACER protocol: 1) theconstruction of the
broadcasting sequences; 2) the distributedbroadcast scheduling scheme; and 3)
the broadcast collisionavoidance scheme. We assume that a time-slottedsystem is
adopted for SUs, where the length of a time slot islong enough to transmit a
broadcast packet [28]. In addition,we assume that the locations of SUs are
static. We alsoassume that each SU knows the locations of its all two-hopneighbors.
We claim that this is a more valid assumptionthan the knowledge of global
network topology. We providea detailed discussion on this issue in Section 4.
In therest of the paper, we use the term “sender” to indicate a SUwho has just
received a message and will rebroadcast themessage. In addition, we use the
term “receiver” to indicatea SU who has not received the message. The notations
usedin our protocol design are listed in Table 1.Fig. 1. The single-hop broadcast scenario.510 IEEE TRANSACTIONS ON MOBILE
COMPUTING, VOL. 14, NO. 3, MARCH 20152.1 Construction of the Broadcasting SequencesThe broadcasting sequences are the
sequences of channelsby which a sender and its receivers hop for successfulbroadcasts.
First of all, we consider the single-hop broadcastscenario. As explained in
Section 1, due to the nonuniformchannel availability in CR ad hoc networks, a
SUsender may have to use multiple channels for broadcastingin order to let all
its neighboring nodes receive thebroadcast message. Accordingly, the
neighboring nodesmay also have to listen to multiple channels in order toreceive
the broadcast message. Hence, the first issue todesign a broadcast protocol is
which channels should beused for broadcasting. One possible method is to
broadcaston all the available channels of the SU sender. However,this method is
quite costly in terms of the broadcastdelay when the number of available
channels is large.Therefore, we propose to select a subset of available
channelsfrom the original available channel set of each SU.First, the available
channels of each SU are ranked basedon the channel indexes. Then, each SU
selects the first wchannels from the ranked channel
list and forms a downsizedavailable channel set. The value of w needs
to becarefully designed to ensure that at least one commonchannel exists
between the downsized available channelsets of the SU sender and each of its
neighboring nodes.The detailed derivation process to obtain a proper w isgiven
in Section 3. Based on the derivation process, eachSU can calculate the value
of w of its own and its one-hopneighbors before a broadcast
starts.On the other hand, the second issue is the sequences ofthe channels by
which a sender and its receivers hop forsuccessful broadcasts. In this paper,
we design differentbroadcasting sequences for a SU sender and its receivers toguarantee
a successful broadcast in the single-hop scenarioas long as they have at least
one common channel. Thesender hops and broadcasts a message on each channel in
atime slot following its own sequence. On the other hand, thereceiver hops and
listens on each channel following its ownsequence. The pseudo-codes for
constructing the broadcastingsequences are shown in Algorithms 1 and 2. wðvÞ is theinitial w of
node v.Algorithm
1: Construction of the Broadcasting
SequenceBSv for
a SU Sender v.Input: wðvÞ; Lv.Output: BSv.1 randomize
the order of elements in Lv;2 BSv ?; = _ initialization _ =3
i 1;4 while i _ wðvÞ2 do5 BSvðiÞ Lvðði mod
wðvÞÞ þ 1Þ;6 i i þ 1; = _ repeat Lv for wðvÞ times _ =7
return BSv;Algorithm 2: Construction of the Broadcasting SequenceRSv for a SU Receiver v.Input: wðvÞ; Lv.Output: RSv.1 randomize the order of elements in Lv;2 RSv ?;
= _ initialization _ =3
j 1;4 while i _ wðvÞ do5 i 16 while j _ wðvÞ2 do7
RSvðði _ 1ÞwðvÞ þ jÞ LvðiÞ;8 j j þ 1; = _ repeat an element forwðvÞ times _ =9 i i þ 1; = _ repeat for every element inLðvÞ _=10
return RSv;From Algorithms 1 and 2, for a SU sender, it hops
periodicallyon the w available channels for w periods
(i.e., w2time
slots). For each receiver, it stays on one of the w availablechannels
for w time slots. Then, it repeats for everychannel in the w available
channels. Fig. 2 gives an exampleto illustrate the construction of the
broadcastingsequences for SU senders and receivers. In Fig. 2, thedownsized
available channel set of a sender and a receiveris f1; 2g and
f2; 3; 4g,
respectively. Based on Algorithm 1,the broadcasting sequence of the sender is f2; 1; 2; 1g. Similarly,based on Algorithm 2,
the broadcasting sequence ofthe receiver is f4; 4; 4; 3; 3; 3; 2; 2; 2g.
Since a sender usuallydoes not know the length of the broadcasting sequence ofthe
receiver, it broadcasts the message following its broadcastingsequence for bM2w2
cþ1 cycles,
where M is the totalnumber of channels. In this way, the total
length of timeslots that the sender broadcasts is bound to be longer thanTABLE 1Notations Used in the
ProtocolNðvÞ The
set of the neighboring nodesof node vNðNðvÞÞ The set of the neighbors of the
neighboringnodes of node vdðv; uÞ The
Euclidean distance betweennode v and urc The radius of the transmission
rangeof each nodej _ j The number of elements in a setLv The downsized available channelset
of node vwðvÞ The
size of the downsized available channelset of node vC The
set of the initial w of intermediate nodesBSv The broadcasting sequence for a
sender vRSv The
broadcasting sequence for a receiver vDSv The default sequence of a sender vstv The starting time slot of a sender
vrtv The
time slot that a receiver vreceives the messageRv The random number assigned to areceiver
v by its senderFig. 2. An example of the broadcasting sequences.SONG AND XIE: BRACER: A
DISTRIBUTED BROADCAST PROTOCOL IN MULTI-HOP COGNITIVE RADIO AD HOC NETWORKS…
511one cycle of the receiver’s
broadcasting sequence. Asshown in Fig. 2, the shaded part represents a
successfulbroadcast.Since every SU calculates the initial value of w based
onits local information and the derivation process in Section 3,different SUs
may obtain different values of w. We furtherdenote ws and wr as the w used
by the sender and thereceiver to construct their broadcasting sequences,
respectively.Note that ws and
wr may
not necessarily be the sameas the initial w calculated
by each SU. They also depend onthe initial w of
its neighboring nodes. The following theoremgives an upper-bound on the
single-hop broadcast delay.Theorem 1. If ws _ wr,
the single-hop broadcast is a guaranteedsuccess within w2rtime slots as long as the sender
and thereceiver have at least one common channel between their downsizedavailable
channel sets.Proof. Based on Algorithm 1, a SU sender
broadcasts onall the channels in its downsized available channel setin ws consecutive time slots. Based on
Algorithm 2, aSU receiver listens to every channel in its downsizedavailable
channel set for wr consecutive
time slots. Ifws_wr, during the wr consecutive time slots for whichthe
SU receiver stays on the same channel, every channelof the SU sender must
appear at least once. Thus, aslong as the SU sender and the receiver have at
least onecommon channel, there must exists a time slot that thesender and the
receiver hop on the same channel duringone cycle of the broadcasting sequence
of thereceiver (i.e., w2r).
Since we let the total length of timeslots that the sender broadcasts be longer
than one cycleof the receiver’s broadcasting sequence, the broadcast isguaranteed
to be successful. tuThen, how to determine ws and wr? From Theorem 1,ws _ wr is
a sufficient condition of a single-hop successfulbroadcast. Therefore, in order
to satisfy this condition, aproper wr needs to be selected by any SU who
has notreceived the broadcast message to ensure the reception ofthe broadcast
message sent from any potential neighbor.Since wr depends on ws and a SU receiver usually does notknow
which neighboring node is sending until it receivesthe broadcast message, it
selects the largest initial w of all itsone-hop neighbors as its
wr.
That is, for a SU receiver v,wrðvÞ ¼ maxfwðuÞju 2 NðvÞg. On the other hand, the senderuses
its calculated initial w as ws to broadcast. Therefore, thews selected by the actual sender is
bound to be smaller thanor equal to this wr. Thus, according to Theorem 1,
the single-hop broadcast is a guaranteed success as long as thesender and its
receiver have at least one common channelbetween their downsized available
channel sets.To illustrate the above discussed operation, we considera
multi-hop scenario shown in Fig. 3. The initial w calculatedby
each SU before the broadcast starts based on itslocal information are shown.
Every node also calculatesthe initial w of
its one-hop neighbors. Without loss of generality,node A is
assumed to be the source node. Based onTheorem 1, the values of wr employed by each receiver canbe
obtained. For instance, since node B knows
the initial wof its neighbors (i.e., wðAÞ ¼ 3, wðDÞ ¼ 4,
and wðFÞ ¼ 4),
itselects the largest initial w as its own wr (i.e., wrðBÞ ¼ 4).Similarly,
we have wrðCÞ ¼ 4; wrðDÞ ¼ 3; wrðEÞ ¼ 4, andwrðFÞ ¼ 5. Then, all nodes except node A use
their wr toconstruct
the broadcasting sequences based on Algorithm2. On the other hand, since each
sender uses its calculatedinitial w as ws, we have wsðAÞ ¼ 3, wsðBÞ ¼ 3;wsðCÞ ¼ 5;wsðDÞ ¼ 4; wsðEÞ ¼ 2, and wsðFÞ ¼ 4. Then, if a node needsto broadcast a message, it uses
its ws to
construct thebroadcasting sequence based on Algorithm 1.2.2 The Distributed Broadcast
Scheduling SchemeNext,
we consider the broadcast scheduling issue in themulti-hop broadcast scenario.
The goal of the proposed distributedbroadcast scheduling scheme is to
intelligentlyselect SU nodes for rebroadcasting in order to achieve theshortest
broadcast delay.First, Fig. 4 shows the simulation results using theparameters
given in Section 5. From Fig. 4, we observe thatthe single-hop broadcast delay
increases when w increases.Therefore, in a
multi-hop broadcast scenario, if there aremultiple intermediate nodes with the
same child node, theintermediate node with the smallest w is
selected torebroadcast. If there are more than one intermediate nodewith the
smallest w, all these nodes should rebroadcastand a broadcast
collision avoidance scheme (which isexplained in detail in Section 2.3) is
executed before theyrebroadcast the message. The pseudo-code of the proposedscheduling
scheme is shown in Algorithm 3, where node vhas
just received the broadcast message from node q andneeds
to decide whether to rebroadcast. Node q includesthe
calculated initial w of its one-hop neighbors in thebroadcast
message. Algorithm 3 indicates that each SUshould know the locations of its
one-hop neighbors (inorder to obtain NðvÞ) and its two-hop neighbors (in
orderto obtain NðqÞ and
dðu; kÞ).
Once a node receives the message,it executes Algorithm 3 to decide whether it
shouldrebroadcast or not. If it needs to rebroadcast, it uses its calculatedinitial
w as ws to
construct the broadcastingFig.
3. A multi-hop broadcast scenario.Fig. 4. The single-hop broadcast delay when ws ¼ wr ¼ w.512
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH 2015sequence based on Algorithm1.
Thus, as illustrated inFig. 3, the message deliveries are shown by the arrows.Algorithm 3: The Pseudo-Code of the BroadcastScheduling
Scheme for a SU Sender v.Input: q;NðvÞ;NðNðvÞÞ; fwðuÞju 2 NðqÞg.Output: Decision of rebroadcasting.1 C fwðvÞg;2 If
fkjk 2 ðNðvÞ _ NðvÞ \ NðqÞÞg
6¼ ? then = _ v has atleast one receiver _ =3 foreach k do4 if
fuju 2 NðqÞ; dðu; kÞ _ rc; u 6¼ vg 6¼
? then= _ there are multiple paths fromq ! k _ =5
foreach u do6 C {C;wðuÞg;7 if
wðvÞ ¼ minC and jfeje ¼ minCgj ¼ 1then
= _ v is the only node with thesmallest
w _ =8 return TRUE;9 else if wðvÞ ¼ minC and jfeje ¼ minCgj >
1then = _ v is one of the multiple nodeswith
the same smallest w _ =10 run Algorithm 4;11 return
TRUE;12 else13
return FALSE; = _ v does notrebroadcast _ =14 else15
return TRUE; = _ v rebroadcaststhe message _ =16 else17
return FALSE;From the above design, it is
noted that each SU (eithersending or receiving) follows the same rules and no
centralizedentity or prior information about the sender isrequired. Thus, the
proposed broadcast scheduling schemeis fully distributed. In addition, since
the node with thesmallest w is selected for rebroadcasting,
the broadcastdelay is the shortest. Moreover, because only a subset ofintermediate
nodes are selected to rebroadcast, the numberof intermediate nodes that need to
forward the message isreduced. Thus, the probability that multiple senders
broadcastingto the same receiver simultaneously can be reduced.Hence, the
proposed broadcast scheduling scheme alsocontributes to the broadcast collision
avoidance.2.3 The
Broadcast Collision Avoidance SchemeFrom Algorithm 3, if there are multiple intermediate nodeswith
the same child node, only the intermediate node withthe smallest w should
rebroadcast. However, if more thanone intermediate node with the same smallest w,
all theseintermediate nodes should rebroadcast and a broadcast collisionmay
occur if these nodes deliver the messages on thesame channel at the same time.
For instance, in the exampleshown in Fig. 5 where node A is
the source node, node Band C have
the same w, which may lead to a broadcast collisionwhen they
rebroadcast simultaneously.Most broadcast collision avoidance methods in
traditionalad hoc networks assign different time slots to differentintermediate
nodes to avoid simultaneous transmissions.However, as explained in Section 1,
these methods cannotbe applied to CR ad hoc networks because the exact time forthe
intermediate nodes to receive the broadcast message israndom. As a result, to
assign different time slots for differentintermediate nodes is very
challenging. In addition,since the intermediate nodes use multiple channels forbroadcasting,
the channel on which the broadcast collisionoccurs is also unknown. To the best
of our knowledge, noexisting collision avoidance scheme can address these
challengesin CR ad hoc networks.In this paper, we propose a broadcast collision
avoidancescheme for CR ad hoc networks. The main idea is to prohibitintermediate
nodes from rebroadcasting on the same channelat the same time. Our proposed
broadcast collisionavoidance scheme works in a scenario where the intermediatenodes
have the same parent node, as shown in Fig. 5.The procedure of the proposed
broadcast collision avoidancescheme is summarized as follows:Step 1 generating a default
sequence. When a source node(e.g., node A in
Fig. 5) broadcasts the message, it includesits own original available channel
set in the message.Hence, if an intermediate node receives the message, itobtains
the original available channel information of itsparent node. Then, the
intermediate node uses the first wavailable
channels of its parent node to generate a defaultsequence, where w is
its own calculated initial w (whichmay not be the same as the
initial w of its parent node). Ifa channel in the default
sequence is not available for thisintermediate node, a void channel is assigned
to replacethe corresponding channel. For instance, if node B and
Cboth obtain w¼3 and the original available channels ofnode A,
B, and C are f1; 2; 3; 4; 5g, f2; 3; 4; 5g,
and f1; 3;4; 6g, respectively, node B and
C only use the first threeavailable channels of node A to
generate their defaultsequences. Therefore, the default sequence of node B isf0; 2; 3g and
the default sequence of node C is f1; 0; 3g,where
0 means a void channel. A node does not send anythingon a void channel.Step 2 circular shifting the
default sequence with a randomnumber. Apart from the available channel set, the sourcenode also
includes a distinctive integer for each intermediatenode v randomly
selected from ½1; wðvÞ_. If there aremore than wðvÞ intermediate nodes, the parent
node randomlyselects wðvÞ of
them and assigns a random integer.Only those intermediate nodes that acquire
the random integerwill rebroadcast the packet. Then, each intermediatenode
generates a new sequence from its default sequenceFig. 5. The broadcast scenario
where a broadcast collision may occur.SONG AND XIE: BRACER: A DISTRIBUTED BROADCAST PROTOCOL IN
MULTI-HOP COGNITIVE RADIO AD HOC NETWORKS… 513using circular shift and the random integer. If we
denote thedefault sequence as DS and the random integer as R,
theintermediate node performs circular shift on the DS for
Rtimes (there is no difference of right-shift or left-shift).
Forinstance, if node B and C get
3 and 1 as
their random integers,respectively, the new sequences they generate fromleft-handed
circular shift are f0; 2; 3g and f0; 3; 1g,respectively.Step 3 forming the broadcasting
sequence. Denote the startingtime slot of
the source node’s broadcasting sequence as st andthe
time slot when an intermediate node receives the broadcastmessage as rt.
The source node includes its st in thebroadcast message. Then, the
intermediate node performs circularshift on the new sequence generated from
Step 2 foranother (rt _ st þ 1)
times. It repeats that sequence for wðvÞtimes to forma cycle of its
broadcasting sequence.The pseudo-code of the broadcast collision avoidancescheme
is shown in Algorithm 4, where q is the source nodeand Circshift()
is the function of circular shift. To further elaboratethe scheme, Fig. 6 shows
an example of the proposedbroadcast collision avoidance scheme. Without loss of
generality,the starting time slot of the source node is 1. When nodeB and
C do not receive the broadcast message, they hopthrough
the channels based on the broadcasting sequencesgenerated from Algorithm 2.
Then, node B and C receive
thebroadcast message at time slot 4 and 1, respectively. Based onAlgorithm 4
and if the random integers for node B and
C are3 and 1, respectively, node B forms
the broadcastingsequence as f2; 3; 0; 2; 3; 0; 2; 3; 0g and node C forms
thebroadcasting sequence as f3; 1; 0; 3; 1; 0; 3; 1; 0g. Then, theystart rebroadcasting
from time slots 5 and 2 using the broadcastingsequences, respectively. The
underlined channels arethose a node hops on if it starts from time slot 1.Algorithm 4: The Pseudo-Code of the BroadcastCollision
Avoidance Scheme for SU v.Input: q;Lq; Lv; stq; rtv;Rv; wðvÞ.Output: BS0v.1 BS0v?; = _ initialization _ =2 i 1;3 l 1;4 While
i _ wðvÞ do = _ generating adefault sequence _ =5 j 16 While j _ wðvÞ do7 If LvðiÞ ¼ LqðjÞ then8 DSvðjÞ LqðjÞ;9 Tv Circshift(DSv;Rv ); = _ circular shifting _ =10 While l _ wðvÞ2 do
= _ forming a broadcastsequence _ =11 BS0vðlÞ Tvðl þ ðrtv _ stqÞ þ 1 mod
wðvÞÞ;12 l l þ 1;13 return BS0v;Therefore, by constructing the broadcasting sequencesfrom
the same channel set (the channel set of the commonparent node, node A)
but circular shifting different times fordifferent nodes, the intermediate
nodes are guaranteed notto send on the same channel at the same time. Thus,
broadcastcollisions can be avoided. In addition, the proposedbroadcast
collision avoidance scheme still works when intermediatenodes are not
synchronized. They can be synchronizedbased on the time stamp received from the
commonparent node. In this way, time slots of the intermediate nodesare
perfectly aligned. Then, broadcast collisions are resolved.A tradeoff of the
proposed broadcast collision avoidancescheme is that less available channels
are used for broadcastingbecause some void channels may be assigned. However,the
benefit (e.g., the increase of the successful broadcastratio) gained from
eliminating broadcast collisions is greaterthan the loss of a very few number
of channels. Hence, theonly issue left is the derivation of the initial w,
which is introducedin Section 3.2.4 Protocol Flow ChartIn this section, we summarize the procedure of the proposedBRACER
protocol. Fig. 7 shows the flow chart of theBRACER protocol. As shown in Fig.
7, before a broadcaststarts, every SU node first calculates its own initial w andthe
initial w of its one-hop neighboring nodes using thetwo-hop
location information. If this node is the sourcenode, it uses its own initial w as
its ws and
constructs thebroadcasting sequence based on Algorithm 1. Then, it hopsand
broadcasts a message on each channel during one timeslot following its
sequence. On the other hand, if this nodeis not the source node, it is by
default a receiver. Then, ituses the maximum w of
its one-hop neighboring nodes asits wr and constructs the broadcasting
sequence based onAlgorithm 2. It hops and listens on each channel during oneFig. 6. An example of the proposed
broadcast collision avoidancescheme.Fig. 7. The flow chart of the proposed
BRACER protocol.514
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH 2015time slot following its sequence.
If the node receives thebroadcast message from a sender, it runs the broadcastscheduling
scheme based on Algorithm 3 to determinewhether it needs to rebroadcast this
message. If it needs torebroadcast and there is only one smallest w,
it uses its ownw as ws and runs Algorithm 1 to
rebroadcast. If it needs torebroadcast and there are more than one smallest w,
itruns the broadcast collision avoidance scheme based onAlgorithm 4 to
rebroadcast the message.3
THE DERIVATION
OF THE VALUE OF wIn
this section, we first introduce a network model weconsider. Then, based on
this model, we present the derivationprocess of the size of the downsized
availablechannel set w.3.1 The Network ModelIn this paper, we consider a CR ad hoc network where NSUs
and K primary users (PUs) co-exist in an a _ a area.PUs
are evenly distributed within the area. The SUs opportunisticallyaccess M licensed
channels. Each SU has a circulartransmission range with a radius of rc. The SUswithin the transmission
range are considered as the neighboringnodes of the corresponding SU. That is,
only whena SU receiver is within the transmission range of a SUtransmitter, the
signal-to-noise ratio (SNR) at the SUreceiver is considered to be acceptable
for reliable communications.In addition, apart from the broadcast collision,other
factors may also contribute to the packet error (e.g.,channel quality,
modulation schemes, and coding rate).However, in this paper, we only consider
broadcast collisionsas the reason for the packet error. We claim that thisis a
valid assumption in most broadcast scenarios [6], [7],[8], [9], [10], [11],
[12], [14], [15], [16], [17], [29], [30].Each SU also has a circular sensing
range with a radiusof rs.
That is, if a PU is currently active within the sensingrange of a SU, the
corresponding SU is able to detect itsappearance. Since different SUs have
different local sensingranges which include different PUs, their acquiredavailable
channels may be different [31], [32]. In addition,because the available
channels of a SU are obtained basedon the sensing outcome within the sensing
range, a SU isnot allowed to communicate with other SUs outside itssensing
range since it may mistakenly use an occupiedchannel by a PU, which results in
interference to the PU.Therefore, in this paper, we assume that rc _ rs.In
this paper, we model the PU activity as an ON/OFFprocess, where the length of
the ON period is the length ofa PU packet. The length of the ON period and the
OFFperiod can follow arbitrary distributions. We assume thateach PU randomly
selects a channel from the spectrumband to transmit one packet which consists
of multipletime slots. Moreover, because PUs at different locationscan claim
any channels for communications, the packetson the same channel do not
necessarily belong to the samePU. This is a more practical scenario, as
compared to somepapers which assume that each channel is associated witha
different PU. Under such a practical scenario, only thosePUs that are within
the sensing range of a SU and areactive during the broadcast process contribute
to theunavailable channels of the SU [18].3.2 The Derivation of the Value of wAs
explained in Section 2, the value of w is
essential to ensurea successful single-hop broadcast. Denote the probability ofa
successful single-hop broadcast as PsuccðwÞ,
where PsuccðwÞis a function of w.
Our goal is to obtain an appropriate w thatsatisfies
the condition: PsuccðwÞ 1 _ _, where _ is
a smallpre-defined value. From Theorem 1, the condition that atleast one common
channel exists between the downsizedavailable channel sets of a SU pair is a
necessary conditionfor a successful single-hop broadcast. Therefore, if we
denotethe source SU of a single-hop broadcast as S0
and the neighborsof S0 as fS1; S2; . . . ; SHg, where H is
the number ofneighbors, PsuccðwÞ is equal to the probability that
there is atleast one common channel between S0
and each of its neighborsin their
downsized available channel sets.3.2.1 The Single-Pair ScenarioWe first calculate the probability that there is at
least onecommon channel between the downsized available channelsets of S0 and
one of its neighbors Si.
The relative locationsof the two SUs and their sensing ranges are shown inFig.
8a. As illustrated in Fig. 8a, sensing ranges are dividedinto three areas: A1, A2,
and A3.
Note that PUs in differentareas have different impact on the channel
availability ofthe two SUs. For instance, if a PU is active within A3,
thechannel used by this PU is unavailable for both SUs. However,if a PU is
active within A1, the channel used by this PUis only unavailable for S0.
Thus, we first calculate the probabilitythat a channel is available within each
area,Pk; k 2 ½1; 2; 3_.
The size of the total network area is denotedas AL (i.e., AL ¼ a2). Since the locations of PUs are
evenlydistributed, the probability that p PUs
are within Ak isPrðpÞ ¼Kp_ _AkAL_ _p AL_AkAL_ _K_p; (1)where
ðKp Þ represents the total combinations
of K choosingp. In addition, we define the
probability that a PU isactive, r,
asr ¼E½ON
duration_E½ON duration_ þ E½OFF duration_; (2)where
E½__ represents
the expectation of the random variable.Therefore, given that there are p PUs
within Ak,
theFig. 8. The single-hop broadcast
scenario.SONG AND XIE: BRACER: A
DISTRIBUTED BROADCAST PROTOCOL IN MULTI-HOP COGNITIVE RADIO AD HOC NETWORKS…
515probability that there are b PUs
active isPrðb j pÞ ¼pb_ _rbð1 _ rÞp_b: (3)Furthermore,
given that there are p PUs and b active
PUswithin Ak,
the probability that there are c available channelsis denoted as Prðc j p; bÞ.
Since the number of availablechannels is only related to the number of active
PUs, c isindependent of p.
In addition, since an active PU randomlyselects a channel from M channels
in the band, Prðc j p; bÞis
equivalent to the probability that there are exactly cempty
boxes given that b balls are randomly put into atotal
of M boxes and a box can have more than one ball(because we
do not limit a channel to only one PU). Thus,Prðc j p; bÞ can be expressed asPrðc j p; bÞ ¼Mc_ _ðM _ cÞSðb;M _ cÞMb ; c 2 ½maxð0;M _ bÞ;M_;(4)where Sðb;M_cÞ is the Stirling number of the
second kind.In addition, Sðb;M_cÞ is
defined asSðb;M_cÞ ¼1ðM_cÞ!XM_ci¼0 ð_1Þi M_ci_ _ðM_c_iÞb: (5)Hence,
the probability that there are c available channelsand there are p PUs
and b active PUs within Ak is the productof (1), (3), and
(4). Then, the probability that a channel isavailable within Ak is obtained from (6).Pk ¼1MXKp¼0Xpb¼0XMc¼maxð0;M_bÞc Mc_ _ðM _ cÞ!Sðb;M _ cÞMbpb_ _rbð1 _ rÞp_b Kp_ _AkAL_ _p AL _ AkAL_ _K_p:(6)Next,
we consider the relationship between the downsizedavailable channel sets of the
two SUs. In our derivation,we only consider the scenario where the senderand
its receiver have the same w (i.e., ws ¼ wr).
Ifwr > ws,
the channels after the first ws channels
do notaffect the number of common channels. Thus, the derivationprocess is the
same. Fig. 9 shows an example of thechannel availability status of two SUs when
wðS0Þ ¼ 3,where a shaded square indicates
an idle channel and awhite square indicates a busy channel. A square with across
means that a channel can be either idle or busy.Since each SU only selects the
first w available channelsto form a downsized available channel
set, theavailability status of the channels after the first w availablechannels
is not specified. Then, without loss of generality,we denote t and
h as the index of the lastavailable channel in the
downsized available channelsets of S0
and Si, respectively. We first assume
thatt _ h. Hence, from channel 1 to t,
there are four possiblescenarios of every channel in terms of its availability
forthe two SUs. They are: 1) the channel is available forboth SUs (denoted as C1);
2) the channel is unavailablefor both SUs (denoted as C2);
3) the channel is onlyavailable for S0
(denoted as C3);
and 4) the channel isonly available for Si (denoted as C4).
In addition, fromchannel t þ 1 to h (if
t < h), there are two possible
scenarios:1) the channel is available for Si but it can be anystatus for S0 (denoted
as C5)
and 2) the channel isunavailable for Si but it can be any status for S0(denoted
as C6).
Based on Fig. 8a, the probabilities ofthe above six scenarios can be obtained:
1) PC1
¼ P1P2P3;2) PC2 ¼ ð1 _ P3Þ þ ð1
_ P1Þð1 _ P2ÞP3;
3) PC3
¼ P1P3ð1_ P2Þ;
4) PC4
¼ ð1 _ P1ÞP2P3; 5) PC5 ¼ PC1þ PC4; and 6)PC6 ¼ PC2 þ PC3.Denote
Zð0; iÞ as the number of common channelsbetween
S0 and
Si in
their downsized available channelsets. In order to obtain PrðZð0; iÞ¼zÞ,
we need to considerall the combinations of the channel status for every channelfrom
channel 1 to h. There are two possible cases: 1)
t¼hand 2) t<h.
For the first case, channel h is a common channelbetween the two
SUs. In addition, from channel 1 tochannel h_1, there must be z _ 1 channels in scenario C1;h _ 2w þ z channels
in C2,
and w_z channels in C3 and
C4,respectively.
Since t¼h, no channel is in scenario C5 or C6.Thus,
the probability that there are zðz>0Þ common
channelsin the first case isP0ðhÞ ¼h _ 1z _ 1_ _h _ zw _ z_ _h _ ww _ z_ _PzC1Ph_2wþzC2 Pw_zC3 Pw_zC4 :(7)For the second case, since t<h,
the common availablechannels can only be between channel 1 to t.
We denotethe number of available channels for Si from channel 1to t as
x. Thus, from channel 1 to t,
similar to the firstcase, there are z channels
in C1;
t _ w _ x þ z channels inC2; w _ z channels
in C3;
and x _ z channels in C4.
Inaddition, from channel t þ 1 to h,
there are w _ x channelsin C5 and
h _ t _ w þ x channels in C6.
Therefore,the probability that there are totally z common
channelsis obtained from (7).P00 1 ðhÞ ¼ PzC1Pw_zC3Xh_1t¼wXt_wx¼maxð0;wþt_hÞt _ 1w _ 1_ _wz_ _t _ wx _ z_ __h _ t _ 1w _ x _ 1_ _Px_zC4 Pðt_w_xþzÞ C2 ðPC1 þ PC4Þðw_xÞ_ðPC2 þ PC3Þðh_t_wþxÞ:(8)If
we switch S0 and Si in
Fig. 9, we can obtain the probabilityfor the dual case. Hence, the probability
that there are zFig. 9. An example of the channel
availability status when w(S0) ¼ 3.516
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH 2015common channels in the second case
is expressed in (9).P00ðhÞ ¼ PzC1Pw_zC3Xh_1t¼wXt_wx¼max ð0;wþt _hÞt _ 1w _ 1_ _wz_ _t _ wx _ z_ __h _ t _ 1w _ x _ 1_ _Px_zC4 Pðt_ w_xþzÞ C2 ðPC1 þ PC4Þðw_xÞ_ðPC2 þ PC3Þðh_t_wþxÞþ PzC1Pw_zC4Xh_1t¼wXt_wx¼max ð0;wþt_hÞt _ 1w _ 1_ _wz_ _t _ wx _ z_ __h _ t _ 1w _ x _ 1_ _Px _zC3 Pðt_w_xþzÞ C2 ðPC1 þ PC3Þðw_xÞ_ðPC2 þ PC4Þðh_t_wþxÞ:(9)Therefore, the probability that
there are z common channelsfor the first w available
channels for each SU isPrðZð0; iÞ ¼ zÞ ¼XMh¼2w_zP0ðhÞ þ P00ðhÞ: (10)Thus, the probability of a
successful single-hop broadcastfrom S0
to Si isPsuccðwÞ ¼ 1 _ PrðZð0; iÞ ¼ 0Þ: (11)Fig.
10a shows the analytical and simulation results ofPsuccðwÞ in
the single-pair scenario under various w and
differentM. To obtain these results, the number of PUs K ¼ 40and the probability that a PU is active r¼0:9.
In addition,the side length of the network area a¼10 (unit length) andtwo neighboring
SUs are at the border of each other’s sensingrange where rs¼2 (unit
length). As shown in Fig. 10a,the simulation results match extremely well with
the analyticalresults.3.2.2
The Multi-Pair ScenarioWe
extend the above results to a multi-pair scenario, asshown in Fig. 8b where Si and Sj are two neighbors of S0.Based
on the knowledge of combination mathematics, theprobability of a successful
broadcast in the multi-pair scenarioshown in Fig. 8b isPsuccðwÞ ¼ 1 _ PrðZð0; iÞ ¼ 0Þ _ PrðZð0; jÞ ¼ 0Þþ PrðZð0; i; jÞ ¼ 0Þ;(12)where
Prðzð0; i; jÞ ¼ 0Þ is the probability that both Si and Sjdo not have any common channel in
the downsized availablechannel sets with S0. Since the other two terms in
(12)(i.e., PrðZð0; iÞ ¼ 0Þ and PrðZð0; jÞ ¼ 0Þ)
can be obtainedfrom (10), we only need to calculate PrðZð0; i; jÞ ¼ 0Þ.To calculate PrðZð0; i; jÞ¼0Þ, we use the same idea fromthe
single-pair scenario. That is, we consider Si and Sjtogether as one new neighboring
node. The sensing rangeof the new neighboring node is the union of the sensingranges
of the two original nodes (i.e., the shaded area inFig. 8b). Therefore, we can
obtain new P1, P2, and P3 for themulti-pair scenario based on the new size of the
sensingrange. Moreover, the probabilities of every scenario of thechannel
status can also be obtained accordingly. Therefore,by using (7)-(10), we can
calculate PrðZð0; i; jÞ ¼ 0Þ. Then,given the locations of the H neighbors,
each SU can get theprobability of a successful single-hop broadcast by
performingthe same procedure iteratively for H times.
Finally, byletting PsuccðwÞ 1 _ _,
a proper w can be acquired for S0.Fig. 10a shows the analytical and
simulation results ofPsuccðwÞ in the two-pair scenario under
various w and differentM.From
Fig. 10b, the simulation results match very wellwith the analytical results.4 DISCUSSION ON THE PROPOSED
BRACERPROTOCOLIt is noted that our proposed
BRACER protocol is particularlydesigned for broadcast scenarios in multi-hop CRad
hoc networks without a common control channel. Asdescribed in Sections 1 and 2,
there are two implementationissues that are essential to the performance of ourproposed
distributed broadcast protocol: 1) the two-hoplocation information; and 2) the
time synchronization. Inthis section, we provide a further discussion on thesetwo
issues.4.1 Two-Hop Location InformationFrom Section 2, in our proposed
BRACER protocol, every SUnode needs the location information of its two-hop
neighboringnodes in order to calculate the size of the downsizedavailable
channel sets of its one-hop neighboring nodes.Even though the localization
issue for CR ad hoc networks isout of the scope of this paper, we hereby
introduce severalsolutions to obtain the two-hop location information indetail.
Generally speaking, the location information for a traditionalad hoc network
can be obtained either from externalpositioning techniques (e.g., Global
Positioning System(GPS) [33]) or from some localization algorithms withoutexternal
positioning techniques [34], [35]. Hence, GPS is anoption to obtain the
location information of the two-hopneighboring nodes in CR ad hoc networks.
However, GPSrequires additional hardware and consumes extra energy,which may
not be efficient in CR ad hoc networks wherecost and power constraints are often
needed.On the other hand, a number of localization algorithmsthat do not rely
on GPS for CR ad hoc networks have beenproposed [36], [37]. In these works, the
legacy localizationalgorithms proposed for traditional ad hoc networks, suchas
time-of-arrival (TOA)-based, angle-of-arrival (AOA)-based, and
received-signal-strength (RSS)-based methodsFig. 10. Analytical and simulation results of PsuccðwÞ under various w anddifferentM.SONG
AND XIE: BRACER: A DISTRIBUTED BROADCAST PROTOCOL IN MULTI-HOP COGNITIVE RADIO
AD HOC NETWORKS… 517are
improved and adopted in CR ad hoc networks. Theselocalization algorithms often
require the assistance fromcertain special nodes with known location
information(named reference nodes). However, all these algorithmsignore the
control message exchange issue between the referencenodes and the regular nodes
in CR ad hoc networks.The control message exchange issue is either not
consideredor simplified by using a common control channel. Based onSection 1,
transmitting messages on a global common channelwithout any additional control
information is not feasiblein CR ad hoc networks. Therefore, in order to
receivethe control message containing the location informationfrom the
reference nodes, a communication mechanism thatdoes not rely on any other
control information (i.e., underblind information) between the reference nodes
and the regularnodes is needed. As mentioned before, in [18], a QoSbasedbroadcast
protocol under blind information is proposed.We can use this scheme as the
communicationscheme between the reference nodes and the regular nodesto obtain
the two-hop location information. Since the broadcastprotocol proposed in [18]
can only support QoS provisioning,the successful broadcast ratio and averagebroadcast
delay of this scheme for the whole network arenot optimized. Therefore, this
scheme is suitable to be usedin the early stage of a broadcast procedure. After
everynode in the network acquires the two-hop location information,the proposed
BRACER protocol can be executed.4.2 Time SynchronizationFrom Section 1, an advantage of our proposed BRACERprotocol is
that it does not require tight time synchronization.This special advantage is
essential since tight time synchronizationis extremely difficult to achieve in
a realad hoc network system. In this paper, we define tight timesynchronization
as the scenario where time slots of differentnodes are precisely aligned. This
means that the proposedBRACER protocol can guarantee the successfulreception of
a whole broadcast message even if the timeslots of the sender and the receiver
have an offset. Denotethe length of the offset as d. Without the loss of generality, dis less than a time slot. Based on Theorem 1, in order
toguarantee a successful single-hop broadcast, ws must besmaller than or equal to wr. Thus, we consider the time
synchronizationissue under the following two scenarios.4.2.1 Scenario Iws is strictly smaller than wr. If ws < wr and the sender andthe receiver
have at least one common channel betweentheir downsized available channel sets,
we have the followingtheorem:Theorem 2. If ws < wr, the single-hop broadcast is a
guaranteedsuccess within w2rtime
slots even if the time slots of the senderand the receiver have an offset.Proof. Similar to the proof of Theorem 1,
if ws < wr,
duringthe wr consecutive
time slots for which the receiver stayson the same channel, every channel of
the sender mustappear at least once. More importantly, since d is lessthan a time slot, at least a whole time slot of
the commonchannel between the sender and the receiver must becompletely covered
by the wr consecutive
time slots ofthe common channel. That is, the receiver can hear awhole time
slot of the common channel when the senderbroadcasts the message. Thus, a
successful single-hopbroadcast is guaranteed. tuFig.
11 shows an example of Scenario 1 where ws < wr.We assume that the time slots of
the sender are ahead of thereceiver with an offset of d. As illustrated in Fig. 11, on the9th slot of the
sender’s broadcasting sequence, the senderand the receiver are on the same
channel (i.e., channel 2). Inaddition, this time slot is completely covered by
the threeconsecutive time slots when the receiver is on channel 2.Hence, the
broadcast message can be successfully receivedby the receiver.4.2.2 Scenario IIws is equal to wr. If ws ¼ wr,
there are two sub-cases: 1)Case
1: a time slot of the common
channel is completelycovered by the wr consecutive time slots of the
receiver onthe same channel; and 2) Case 2: a
time slot of the commonchannel is partially covered by the wr consecutive timeslots of the
receiver on the same channel. Fig. 12 shows anexample of Case 1 in Scenario II.
Similar to Scenario I, thebroadcast message can still be successfully received
even ifan offset exists.On the other hand, Fig. 13 shows an example of Case 2
inScenario II. This case occurs when the time slot of the commonchannel of the
sender is partially covered by the firstand the last time slot of the wr consecutive time slots of thereceiver.
From the communication theory, if a node onlyreceives a part of a packet, it
cannot decode this packet correctlyand will drop it at the physical (PHY)
layer. Thus,even if the sender and the receiver have a common channel,the
receiver cannot successfully receive the broadcast messagewithin w2rtime slots in Case 2.We provide
two simple modifications of our proposedBRACER protocol for this case. The
first way is that thereceiver always shift the whole cycle of the broadcastingsequence
one slot forward or one slot backward after ithops for one cycle (i.e., w2rtime slots) and has not receivedFig. 12. An example of Case 1 in
Scenario II when time slots areunsynchronized.Fig. 11. An example of Scenario I
when time slots are unsynchonized.Fig. 13. An example of Case 2 in Scenario II
when time slots areunsynchronized.518 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH
2015the broadcast message. At the same
time, the total length oftime slots that the sender broadcasts needs to be
longer thanthree cycles of the receiver’s broadcasting sequence. That is,the
sender broadcasts the message following its broadcastingsequence for b3_M2w2scþ1 cycles. In this way, Case 2becomes Case 1. Then, even
if the receiver may not receivethe message within one cycle, it can still successfully
receivethe message in the following cycle, as shown in Fig. 14.On the other
hand, the second way is that the receiver vselects
wrðvÞ to be maxfwðuÞju 2 NðvÞg þ 1,
where NðvÞ isthe
set of the neighboring nodes of the receiver v.
Therefore,the wr of
the receiver is always larger than the wsused by the sender. In this way,
Case 2 becomesScenario I. Based on Theorem 2, the successful broadcast isguaranteed
within w2rtime slots, as shown in Fig. 15.
Tosum up, from the above analysis, our proposed BRACERprotocol can be used in
an environment where tight timesynchronization is not required.5 PERFORMANCE EVALUATIONIn this section, we evaluate the
performance of the proposedbroadcast protocol. We consider two types of PU
trafficmodels in the simulation [38]. The first PU traffic model isdiscrete-time,
where the PU packet inter-arrival time followsthe biased-geometric distribution
[39]. The second PUtraffic model is continuous-time, where the PU packet
interarrivaltime follows the Pareto distribution [39]. We assumethat the
probability that a PU is active is fixed (i.e., r ¼ 0:9).In addition, the side length of
the network area a ¼ 10 (unitlength). We assume that the
radius of the sensing range andthe transmission range are the same (i.e., rs ¼ rc ¼ 2 (unitlength)). In this paper, we mainly investigate the
followingtwo performance metrics: 1) successful broadcast ratio: theprobability that all nodes in a network successfully
receivethe broadcast message and 2) average broadcast delay: theaverage duration from the moment a broadcast starts tothe
moment the last node receives the broadcast message.In addition, we compare our
proposed broadcast protocolwith five other schemes: 1) RandomþFlooding: each SU randomlyselects a
channel to hop and uses flooding (i.e., a SUis obligated to rebroadcast once
receiving the message); 2)SequenceþFlooding
(1=3 of our design): each SU downsizesits available channel
set and constructs broadcastingsequences based on our scheme and uses flooding;
3)SequenceþSchedule
(2=3 of our design): each SU constructsbroadcasting
sequences based on our scheme and uses ourbroadcast scheduling scheme; 4) Basic QoS Scheme: each SUuses the basic scheme of
the QoS-based broadcast protocolto broadcast [18]; and 5) JSþFlooding: each SU uses thejump-stay scheme
[26] to construct the broadcastingsequences and uses flooding.5.1 Successful Broadcast RatioSince the single-hop successful
broadcast ratio depends onw which is related to a pre-defined
value _, we define_ ¼ 0:001. In fact, _ can
be an arbitrary small value. Thus,based on Section 3, each SU calculates a
proper w before thebroadcast starts in our scheme, the Sequence+Floodingscheme, and the SequenceþSchedule
scheme. Tables 2 and 3show the
simulation results of the successful broadcast ratiounder different number of
SUs and PUs, where the value inthe upper cell is for the discrete-time PU traffic
and thelower cell is for the continous-time PU traffic. In Table 2,M ¼ 20 and K ¼ 40. In Table 3, M ¼ 20 and N ¼ 20. Asshown in Tables 2 and 3, the successful broadcast
ratio ishigher than 99 percent under our proposed broadcast protocolin all scenarios.
In addition, the proposed broadcast protocoloutperforms other schemes in terms
of highersuccessful broadcast ratio. Since the jump-stay schemerequires that
the ith available channel in the available channelset is
also channel i, it cannot utilize the technique
in ourscheme to downsize the original available channel set. Inaddition, the
jump-stay scheme can guarantee rendezvouswithin 6MPðP_GÞ,
where P is the smallest prime numberlarger than M and
G is the number of common channelsbetween two SUs. Thus,
in order to ensure a successfulbroadcast, each SU broadcasts the message for 6MPðP_GÞFig. 14. An example of the first
way of modification for Case 2 inScenario II when time slots unsynchronized.Fig.
15. An example of the second way of modification for Case 2 in ScenarioII when
time slots are unsynchronized.TABLE 2Successful Broadcast Ratio under Different Number of SUsN N¼ 10 N ¼ 15 N ¼ 20 N ¼ 25RandomþFlooding 0.8801 0.8180 0.8100 0.8726 0.88210.8630
0.9148 0.9075 0.8698 0.8708SequenceþFlooding
0.9849 0.9839 0.9828 0.9823 0.98630.9762
0.9769 0.9777 0.9773 0.9719SequenceþSchedule
0.9859 0.9864 0.9823 0.9857 0.98550.9812
0.9845 0.9849 0.9876 0.9861Basic
QoS Scheme 0.8915 0.9022
0.8543 0.9314 0.93170.8739 0.8386 0.8952 0.8498 0.8624Proposed Scheme 0.9991 0.9973 0.9969 0.9982 0.99090.9994
0.9959 0.9954 0.9967 0.9951TABLE 3Successful Broadcast Ratio under Different Number of PUsK ¼ 20 K ¼ 30 K ¼ 40 K ¼ 50 K ¼ 60RandomþFlooding 0.8189 0.8326 0.8842 0.9208 0.89070.7980
0.8738 0.9191 0.9139 0.8849SequenceþFlooding
0.9866 0.9863 0.9823 0.9819 0.98710.9742
0.9765 0.9773 0.9711 0.9797SequenceþSchedule
0.9868 0.9872 0.9857 0.9881 0.98720.9874
0.9885 0.9876 0.9833 0.9850Basic
QoS Scheme 0.9502 0.9167
0.9314 0.8222 0.78840.8950 0.8921 0.8498 0.8792 0.8463Proposed Scheme 0.9978 0.9976 0.9982 0.9951 0.99210.9946
0.9941 0.9967 0.9977 0.9969SONG AND XIE: BRACER: A DISTRIBUTED BROADCAST PROTOCOL IN
MULTI-HOP COGNITIVE RADIO AD HOC NETWORKS… 519slots. However, 6MPðP_GÞ is
usually a very large numberwhen M is large. Hence, to better
illustrate the trade-offbetween the successful broadcast ratio and broadcast
delay,we compare our scheme with JSþFlooding in Section 5.2.5.2 Average Broadcast DelayTables 4 and 5 show the simulation
results of the averagebroadcast delay under different number of SUs and PUs.Similarly
to the successful broadcast ratio, in Table 4,M ¼ 20 and K ¼ 40. In Table 5, M ¼ 20 and N ¼ 20. Asshown in Tables 4 and 5, the proposed broadcast
protocoloutperforms other schemes in terms of shorter averagebroadcast delay.
Furthermore, Figs. 16 and 17 show the averagebroadcast delay under different
number of channelswhen N ¼ 10 and
K ¼ 40. As explained in Section 1,
besidesour proposed scheme, we also compare with JSþFloodingand our scheme without downsizing
the available channelset (i.e., w ¼ M). It is shown that even though
the successfulbroadcast ratio is similar, the broadcast delay underJSþFlooding
is much longer than our proposed
scheme.To sum up, our proposed broadcast protocol outperformsRandomþFlooding
in terms of higher successful
broadcastratio and shorter broadcast delay. It also outperformsJSþFlooding
in terms of shorter broadcast
delay. In addition,even with the tradeoff in our proposed broadcast collisionavoidance
scheme as explained in Section 2.3 and limitedoverhead, our proposed scheme and
the schemes that use apart of our design (e.g., SequenceþFlooding) can still achievebetter
performance results than RandomþFlooding
for bothmetrics and JSþFlooding
for the broadcast delay.5.3 The Impact of Unsynchronized
Time SlotsFrom the
discussion in Section 4.2, our proposed BRACERprotocol has an advantage that
tight time synchronizationis not required. Accordingly, we provide two
modificationsof our proposed protocol when time slots are unsynchronized.In
this section, we evaluate the impact of the unsynchronizedtime slots on the
performance of the proposedBRACER protocol.Figs. 18 and 19 show the single-hop
successful broadcastratio and the average broadcast delay under different
scenarios.In the first modification, we let ws ¼ wr ¼ w,whereas
in the second modification, we let ws ¼ w andwr ¼ w þ 1.
It is shown that unsynchronized scenarios usuallylead to lower successful
broadcast ratio and longeraverage broadcast delay than the synchronized
scenario.However, with the modifications of our proposed protocol,the low
successful broadcast ratio can be significantlyimproved. From the figures, we
may see that the secondmodification outperforms the first modification in terms
ofhigher successful broadcast ratio. However, it also results inlonger average
broadcast delay than the first modification.Fig. 16. Successful broadcast ratio under different number of
channels.TABLE 4Average Broadcast Delay
under Different Number of SUsDelay
(unit: slots) N ¼ 5 N ¼ 10 N ¼ 15 N ¼ 20 N ¼ 25RandomþFlooding
19.781 26.483 28.003 29.252 31.20320.981
23.765 27.686 33.153 32.883SequenceþFlooding
8.458 11.168 12.744 14.243 15.9097.712
11.799 12.903 14.534 17.257SequenceþSchedule
7.811 10.995 13.324 13.896 15.8237.155
11.457 13.553 14.551 15.078Basic
QoS Scheme 15.576 19.642
26.447 22.745 24.59916.093 23.164 21.698 26.834 32.078Proposed Scheme 7.066 10.532 12.259 13.353 15.1986.545
11.097 12.786 13.639 14.801TABLE 5Average Broadcast Delay under Different Number of PUsDelay (unit: slots) K ¼ 20 K ¼ 30 K ¼ 40 K ¼ 50 K ¼ 60RandomþFlooding 29.189 31.459 25.737 25.361 24.24334.547
30.629 27.617 28.424 26.399SequenceþFlooding
13.918 14.886 14.243 14.649 14.25914.413
13.958 14.534 14.867 14.389SequenceþSchedule
12.747 14.206 13.896 14.361 14.01413.652
14.086 14.551 14.521 14.237Basic
QoS Scheme 25.148 25.187
22.745 27.182 28.53329.111 24.931 26.834 24.639 24.907Proposed Scheme 12.322 13.555 13.352 14.279 13.59713.249
13.401 13.639 13.335 13.471Fig. 17. Average broadcast delay under different unmber of
channels.Fig. 18. The impact of unsynchronized time slots on the single-hop
successfulbroadcast ratio.520
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 14, NO. 3, MARCH 2015Furthermore, when w>5,
the performance of the two modificationsis very close to the unsynchronized
scenario withoutmodification. This is because that when w is
largeenough, more than one common channels exist between thesender and the
receiver. Thus, there is at least one time sloton the common channel that is
completely covered by thewr consecutive
time slots. Hence, the receiver can successfullyreceive the message without any
modification.Figs. 20 and 21 show the multi-hop successful broadcastratio and
average broadcast delay under differentscenarios. It is illustrated in Fig. 20
that when the numberof SUs is small (e.g., N < 20), the synchronized scenariooutperforms
all the unsynchronized scenarios in terms ofhigher successful broadcast ratio.
This is because when Nis small, each SU usually selects
small w for broadcasting.Thus, from Fig. 18, the successful
broadcast ratio of theunsynchronized scenarios is lower than the synchronizedscenario.
However, when N is large (e.g., N > 20),
theunsynchronized scenarios with both modifications outperformthe synchronized
scenario in terms of higher successfulbroadcast ratio. This is because when N is
large, areceiver often has more than one senders. These sendersbroadcast the
message on different channels to thereceiver. Thus, the impact of
unsynchronized time slots isdiminished.5.4 Broadcast Collision AnalysisIn this section, we evaluate the performance of
broadcastcollisions for our proposed BRACER protocol. Sincebroadcast collisions
usually lead to the waste of networkresources, they should be efficiently
avoided to save networkresources. In this paper, we use the average numberof
broadcast collisions in a broadcast procedure per SUnode as the performance
metric.Fig. 22 shows the average number of broadcast collisionsunder different
numbers of channels. It is illustratedthat the Proposed Scheme outperforms the SequenceþFloodingand SequenceþSchedule schemes in terms of fewer
broadcastcollisions on average. This means that the broadcast collisionavoidance
scheme in the Proposed Scheme
can effectivelyavoid broadcast
collisions. In addition, the ProposedScheme
also incurs fewer broadcast
collisions than the RandomþFloodingscheme when M _ 20. That is, the RandomþFloodingscheme performs better than the Proposed Schemeonly when M is
very large. This is because that in the RandomþFloodingscheme, each sender randomly
selects anavailable channel in the band to broadcast. If the numberof channels
is large, the probability that two senders selectthe same channel is fairly
low. However, when M is small,the RandomþFlooding
scheme leads to the highest numberof
broadcast collisions among the four schemes (e.g.,M ¼ 5). Even though the RandomþFlooding scheme causesthe fewest broadcast
collisions when M is large, the successfulbroadcast
ratio and average broadcast delay of theRandomþFlooding scheme are not acceptable, as
shown inTables 2, 3, 4, and 5. Additionally, the SequenceþSchedulescheme performs better than the SequenceþFlooding
scheme,as shown in Fig. 22. This
means that our proposedFig.
19. The impact of unsynchronized time slots on the single-hop averagebroadcast
delay.Fig. 20. The impact of unsynchronized time slots on the multi-hop
successfulbroadcast ratio.Fig. 21. The impact of unsynchronized time slots on
the multi-hop averagebroadcast delay.Fig. 22. Average number of broadcast
collisions under different numbersof chennels when N ¼ 10.SONG
AND XIE: BRACER: A DISTRIBUTED BROADCAST PROTOCOL IN MULTI-HOP COGNITIVE RADIO
AD HOC NETWORKS… 521distributed
broadcast scheduling scheme also contributesto the collision avoidance.5.5 Overhead AnalysisOverhead is an important metric to
evaluate the efficiencyof a broadcast protocol. To evaluate the impact of
overhead,we use normalized overhead as the performancemetric [40], [41].
Normalized overhead is defined as theratio of the total broadcast packets (in
bits) propagated byevery node in the network to the total broadcast packets(in
bits) received by the receivers [40], [41].We denote the length of the original
broadcast packet asLb.
Based on Section 2.2, extra information needs to beadded in the original
broadcast packet in order to realizethe proposed BRACER protocol. The extra information
in abroadcast packet mainly consists of three parts. First of all,as mentioned
in Section 2.2, the sender should include thecalculated initial w of
its one-hop neighbors in the broadcastmessage. Second, as described in Section
2.3, thesender should include its own channel availability informationand the
starting time slot of its broadcastingsequence in the message. Thirdly, the
sender shouldinclude random integers for the intermediate nodes whoneed to
rebroadcast to the same node. Thus, if we definethe length of the initial w,
the starting time slot, and therandom integer as 8 bits, the length of the
total extra informationin a broadcast packet in bits for a node isQ ¼ 8Na þM þ 8 þ 8Nb; (13)where
Na is
the number of the one-hop neighbors of thenode and Nb is the number of the intermediate
nodes whoneed to rebroadcast to the same node. Therefore, the totallength of a
broadcast packet of the proposed BRACER protocolis Lb þ Q.Fig.
23 shows the normalized overhead under differentlengths of the original
broadcast packet. We set the range ofthe original broadcast packet length as ½50; 500_ bits.
Sincebroadcast packets are control packets which are often veryshort, they
mainly fall in this range. In addition, we compareour proposed scheme with the SequenceþFlooding
andSequenceþSchedule schemes. The RandomþFlooding
schemedoes not require the two-hop
location information, so weexclude it for fair comparison. From Section 2, the
length ofthe extra information in a broadcast packet for the SequenceþFloodingand SequenceþSchedule schemes are Q ¼ 0
andQ ¼ 8Na, respectively. Thus, the Proposed Scheme has thelongest broadcast packets
among the three schemes. Eventhough the Proposed Scheme has
the longest extra informationin a packet, it outperforms the other two schemes
interms of lower normalized overhead, as shown in Fig. 23.The Proposed Scheme can achieve up to 106 and 12:5 percentless normalized overhead than the SequenceþFlooding
andSequenceþSchedule schemes, respectively.Fig. 24
shows the normalized overhead under differentnumbers of SUs. We use the AODV
route request (RREQ)packet as a typical original broadcast packet (i.e., Lb ¼ 192bits)
[42]. From Fig. 24, it is shown that the proposed BRACERbroadcast protocol
outperforms the other twoschemes in terms of lower normalized overhead undervarious
numbers of SUs. More importantly, when thenumber of SUs increases by 400
percent, the normalizedoverhead of the Proposed Scheme only
increases by 115 percent.Thus, the scalability of the proposed BRACER protocolis
satisfactory.6 CONCLUSIONIn this paper, the broadcasting
challenges specifically inmulti-hop CR ad hoc networks under practical
scenarioswith collision avoidance have been addressed for thefirst time. A
fully-distributed broadcast protocol namedBRACER is proposed without the
existence of a global orlocal common control channel. By intelligently
downsizingthe original available channel set and designing the broadcastingsequences
and broadcast scheduling schemes, ourproposed broadcast protocol can provide
very high successfulbroadcast ratio while achieving very short broadcastdelay.
In addition, it can also avoid broadcast collisions.Simulation results show
that our proposed BRACER protocoloutperforms other possible broadcast schemes
in termsof higher successful broadcast ratio and shorter averagebroadcast
delay.ACKNOWLEDGMENTSThis work was supported in part by the US NationalScience
Foundation (NSF) under Grant No. CNS-0953644, CNS-1218751, and CNS-1343355. The
authorswould like to thank the anonymous reviewers for theirconstructive
comments which greatly improved the qualityof this work.Fig. 23. Normalized overhead under
lengths of the original broadcastpacket.Fig. 24. Normalized overhead under
different numbers of SUs whenLb ¼ 192 bits.522 IEEE TRANSACTIONS ON MOBILE
COMPUTING, VOL. 14, NO. 3, MARCH 2015networks,” IEEE
Pers. Commun., vol. 8, no.
1, pp. 16–28, Feb. 2001.[42] C. E. Perkins, E. M. Belding-Royer, and S. Das, “Ad
hoc ondemanddistance vector (AODV) routing,” Request for Comments(RFC) 3561,
Internet Eng. Task Force (IETF), Jul. 2003.SONG AND XIE: BRACER: A DISTRIBUTED BROADCAST PROTOCOL IN
MULTI-HOP COGNITIVE RADIO AD HOC NETWORKS… 523Yi Song received the BS degree in electricalengineering from Wuhan
University, Wuhan,China, in 2006, the ME degree in electrical engineeringfrom
Tongji University, Shanghai, China,in 2008, and the PhD degree in electrical
engineeringfrom the University of North Carolina atCharlotte, Charlotte, in
2013. He joined theDepartment of Electrical Engineering and ComputerScience,
Wichita State University, as anassistant professor in August 2013. He receivedthe
Kansas National Science Foundation EPSCoRFirst Award in 2014. His research
interests include protocol design,modeling, and analysis of spectrum management
and spectrum mobilityin cognitive radio networks.Jiang Xie received the BE degree from TsinghuaUniversity, Beijing, China,
in 1997, the MPhildegree from the Hong Kong University of Scienceand Technology
in 1999, and the MS and PhDdegrees from the Georgia Institute of Technology,in
2002 and 2004, respectively, all in electrical andcomputer engineering. She
joined the Departmentof Electrical and Computer Engineering at the Universityof
North Carolina at Charlotte (UNC-Charlotte)as an assistant professor in August
2004,where she is currently an associate professor. Hercurrent research
interests include resource and mobility management inwireless networks, QoS
provisioning, and the next-generation Internet.She is on the Editorial Boards
of the IEEE Transactions on Mobile
Computing,IEEE Communications Surveys and Tutorial, Computer Networks(Elsevier), Journal of Network and Computer Applications(Elsevier), and the Journal of Communications (ETPub). She receivedthe US
National Science Foundation (NSF) Faculty Early Career Development(CAREER)
Award in 2010, a Best Paper Award from IEEE/WIC/ACM International Conference on
Intelligent Agent Technology (IAT2010), and a Graduate Teaching Excellence
Award from the College ofEngineering at UNC-Charlotte in 2007. She is a senior
member of theIEEE and the ACM.”
For more information on this or
any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.524 IEEE TRANSACTIONS ON MOBILE
COMPUTING, VOL. 14, NO. 3, MARCH 2015
CHAPTER 99.1 REFERENCES[1] S. Fenz, “An ontology- and
Bayesian-based approach for determining threat probabilities,” in Proceedings
of the 6 th ACM Symposium on Information, Computer and Communications Security,
ser. ASIACCS ’11. New York, NY, USA: ACM, 2011, pp. 344–354. [2] M. Frigault,
L. Wang, A. Singhal, and S. Jajodia, “Measuring network security using dynamic
Bayesian network,” in Proceedings of the 4 th ACM Workshop on Quality of
Protection, ser. QoP ’08. New York, NY, USA: ACM, 2008, pp. 23–30. [Online].
Available: http://doi.acm.org/10.1145/1456362.1456368
[3] N. Poolsappasit, R. Dewri, and I. Ray, “Dynamic security risk management
using Bayesian attack graphs,” IEEE Transactions on Dependable and Secure
Computing, vol. 9, no. 1, pp. 61–74, Jan 2012. [4] P. Xie, J. H. Li, X. Ou, P.
Liu, and R. Levy, “Using Bayesian networks for cyber security analysis,” in The
40th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN), 2010. [5] S. Noel, S. Jajodia, L. Wang, and A. Singhal,
“Measuring security risk of networks using attack graphs,” International
Journal of Next-Generation Computing, vol. 1, no. 1, July 2010. [6] L. Wang, T.
Islam, T. Long, A. Singhal, and S. Jajodia, “An attack graph-based
probabilistic security metric,” in Proceedings of the 22nd annual IFIP WG 11.3
Working Conference on Data and Applications Security. Berlin, Heidelberg:
Springer-Verlag, 2008, pp. 283–296. [7] R. E. Sawilla and X. Ou, “Identifying
Critical Attack Assets in Dependency Attack Graphs,” in Proceedings of the 13th
European Symposium on Research in Computer Security: Computer Security, ser.
ESORICS ’08. Berlin, Heidelberg: SpringerVerlag, 2008, pp. 18–34.
Project Ideas
Statistical Dissemination Control in Large Machine-to-Machine Communication Networks
Cloud based machine-to-machine (M2M) communications have emerged to achieve ubiquitous and autonomous data transportation for future daily life in the cyber-physical world. In light of the need of network characterizations, we analyze the connected M2M network in the machine swarm of geometric random graph topology, including degree distribution, network diameter, and average distance (i.e., hops). Without the need of end-to-end information to escape catastrophic complexity, information dissemination appears an effective way in machine swarm. To fully understand practical data transportation, G/G/1 queuing network model is exploited to obtain average end-to-end delay and maximum achievable system throughput.
Furthermore,
as real applications may require dependable networking performance across the
swarm, quality of service (QoS) along with large network diameter creates a new
intellectual challenge. We extend the concept of small-world network to form
shortcuts among data aggregators as infrastructure-swarm two-tier heterogeneous
network architecture, and then leverage the statistical concept of network control
instead of precise network optimization, to innovatively achieve QoS
guarantees. Simulation results further confirm the proposed heterogeneous
network architecture to effectively control delay guarantees in a statistical
way and to facilitate a new design paradigm in reliable M2M communications.
1.2 INTRODUCTION:
Cloud based machine-to-machine (M2M) communications have emerged to enable services through interaction between cyber and physical worlds, achieving ubiquitous and autonomous data transportation among objects and the surrounding environment in our daily lives. The wireless network involving tremendous machines that the availability of end-to-end information at each machine is not possible is referred to the large M2M network, which is getting importance into next-generation wireless systems. While these tremendous machines have short-range communication capabilities, multi-hop networking is a must for information dissemination over machine swarm. The connectivity and low delivery latency in the machine swarm are consequently crucial to achieve reliable communications.
However, lacking complete understanding of large network characteristics, effective traffic control for message delivery remains open a proper control scheme of routing with quality-of-service (QoS) guarantee regarding end-to-end delay becomes an urgent need to practically facilitate M2M communications. This is even more challenging due to the scalability of multi-hop ad hoc networks and energy-efficient and spectral efficient operation for each machine. To investigate the routing mechanism for large-scale networks, network topology analysis can be scientifically exploited by random network analysis provides a comprehensive study in network structure and functions from complex networks perspective. Aiming at social communities mediated by network technologies, reviews the historical research for community analysis and community discovery methods in social media.
We develop an unbiased sampling for users in an online social network by crawling the social graph, they further examine multiple underlying relations for such network in to introduce a random walk sampling. For social networks related research, proposes the information-centric networking as it brings the advantages to the network operator and the end users. Exploring various research challenges in context management, presents a context management architecture that is suitable for social networking systems enhanced with pervasive features. Through a survey of current routing solutions, discuss the trend toward social based routing protocols, which are classified by employed network graph.
In addition, to employ social network analysis in message delivery remarkably pioneers the methodology to exercise the small-world phenomenon of social networks in navigation, successfully creating transmissions with less delay. Small-world phenomenon plays a crucial role in social networks, which states that each individual in such network links to others by a short chain of acquaintances and has great potential for improving spectral and energy efficiency for shorting the end-to-end delay. Reference also presents a thorough examination of average message delivery time for small-world networks in the continuum limit. Via random network analysis, studies the properties of giant component in wireless multi-hop networks, while provides a heterogeneous structure for such networks and conducts the throughput and delay analysis. Furthermore, the concepts of rumor and gossip routing algorithms are also widely employed in sensor networks for disconnected delay-tolerant MANETs and generalized complex networks, and respectively provide the social network analysis for information flow and epidemic information dissemination.
In this paper, inspired by small-world
phenomenon, we connect data aggregators (DAs) to machine swarm and propose a promising
two-tier heterogeneous architecture with DA’s smallworld network for statistical traffic
control in large M2M communication networks. To address efficient dissemination
control for routing and QoS such as surveillance applications, we first
analytically supply the condition to establish connected M2M networks and
explore some essential geometric properties (i.e., degree distribution, network
diameter, and average distance) for the networks. Analytic bounds of average
distance characterize the average number of hops that machines’ packets need to
traverse over the swarm, thus dominating the QoS guarantee capability for
reliable communications. Furthermore, through G/G/1 (i.e., both inter-arrival
time and service time distributions of a traffic queue are arbitrary
distributions) queuing network model for traffic modeling, the practical data
transportation takes place in connected M2M networks. Both the average
end-to-end delay and maximum achievable throughput per machine from information
dissemination in machine swarm multi-hop networking are examined.
1.3 LITRATURE SURVEY
TOWARD UBIQUOTOUS MASSIVE ACCESS IN 3GPP MACHINE-TO-MACHINE COMMUNICATIONS IN 3GPP
AUTHOR: S. Lien, K. C. Chen, and Y. Lin,
PUBLISH: IEEE Commun. Mag., vol. 49, no. 4, pp. 66–74, Apr. 2011.
EXPLANATION:
To enable full
mechanical automation where each smart device can play multiple roles among
sensor, decision maker, and action executor, it is essential to construct
scrupulous connections among all devices. Machine-to-machine communications
thus emerge to achieve ubiquitous communications among all devices. With the
merit of providing higher-layer connections, scenarios of 3GPP have been
regarded as the promising solution facilitating M2M communications, which is
being standardized as an emphatic application to be supported by LTE-Advanced.
However, distinct features in M2M communications create diverse challenges from
those in human-to-human communications. To deeply understand M2M communications
in 3GPP, in this article, we provide an overview of the network architecture
and features of M2M communications in 3GPP, and identify potential issues on
the air interface, including physical layer transmissions, the random access
procedure, and radio resources allocation supporting the most critical QoS
provisioning. An effective solution is further proposed to provide QoS
guarantees to facilitate M2M applications with inviolable hard timing
constraints.
SMALL-WORLD NETWORKS EMPOWERED LARGE MACHINE-TO-MACHINE COMMUNICATIONS
AUTHOR: L. Gu, S. C. Lin, and K. C. Chen
PUBLISH: IEEE WCNC, 2013, pp. 1–6.
EXPLANATION:
Cloud-based
machine-to-machine communications emerge to facilitate services through linkage
between cyber and physical worlds. In addition to great challenges in a large
network of machine/sensor swarm, effective network architecture involving
interconnection of wireless infrastructure and multi-hop ad hoc networking in
the machine swarm remains open. Inspired by the small-world phenomenon in
social networks, we may establish a short-cut path under heterogeneous network
architecture through wireless infrastructure and cloud, by connecting to data
aggregators or access points in the machine swarm, such that end-to-end delay
can be significantly reduced. Our mathematical analysis on network diameter and
average delay, along with verifications by simulations, demonstrate spectral
and energy efficiency of our proposed heterogeneous network architecture in
large machine-to-machine communication networks.
COGNITIVE MACHINE-TO-MACHINE COMMUNICATIONS: VISIONS AND POTENTIALS FOR THE SMART GRID
AUTHOR: Y. Zhang et al.,
PUBLISH: IEEE Netw., vol. 26, no. 3, pp. 6–13, May/Jun. 2012.
EXPLANATION:
Visual capability
introduced to Wireless Sensor Networks (WSNs) render many novel applications
that would otherwise be infeasible. However, unlike legacy WSNs which are commercially
deployed in applications, visual sensor networks create additional research
problems that delay the real-world implementations. Conveying real-time video
streams over resource constrained sensor hardware remains to be a challenging
task. As a remedy, we propose a fairness-based approach to enhance the event
reporting and detection performance of the Video Surveillance Sensor Networks.
Instead of achieving fairness only for flows or for nodes as investigated in
the literature, we concentrate on the whole application requirement.
Accordingly, our Event-Based Fairness (EBF) scheme aims at fair resource
allocation for the application level messaging units called events. We identify
the crucial network-wide resources as the in-queue processing turn of the
frames and the channel access opportunities of the nodes. We show that fair
treatment of events, as opposed to regular flow of frames, results in enhanced
performance in terms of the number of frames reported per event and the
reporting latency. EBF is a robust mechanism that can be used as a stand-alone
or as a complementary method to other possible performance enhancement methods
for video sensor networks implemented at other communication layers.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Existing methods for nodes as investigated in the literature; machine-to-machine communications emerge to facilitate services through linkage between cyber and physical worlds. In addition to great challenges in a large network of machine/sensor swarm, effective network architecture involving interconnection of wireless infrastructure and multi-hop ad hoc networking in the machine swarm remains open. Inspired by the small-world phenomenon in social networks, we may establish a short-cut path under heterogeneous network.
Previous discussion of existing tradeoff, but heterogeneous schemes are able to provide promising guaranteed throughput even under strict QoS demand for tight τ.Moreover, Fig. 8 further provides the exhaustive throughput comparison among different scenarios to complete our evaluation. While QoS guaranteed throughput is upper bounded by maximum achievable throughput, the great throughput improvement is provided by heterogeneous architecture as compared with plain machine swarm.
QoS fair resource
allocation for the application level messaging units called events. We identify
the crucial network-wide resources as the in-queue processing turn of the
frames and the channel access opportunities of the nodes that fair treatment of
events, as opposed to regular flow of frames, results in enhanced performance
in terms of the number of frames reported per event and the reporting latency
can be used as a stand-alone or as a complementary method to other possible
performance enhancement methods for video sensor networks implemented at other
communication layers.
2.1.1 DISADVANTAGES:
- Single source-destination pair, there exist a source machine, a destination machine, and several relay machines that forward traffic from the source to the destination.
- Data loss of generality, it is assumed that sequences of packets follow the general arrival process and the general service time, and each transmission link is modeled.
- Such a queue represents a queuing system with a single server, infinite buffer size, and the scheduling discipline of interarrival times have a general (meaning arbitrary) distribution and service times have a (different) general distribution.
2.2 PROPOSED SYSTEM:
Machine-to-machine (M2M) communications emerge to autonomously operate to link interactions between Internet cyber world and physical systems. We present the technological scenario of M2M communications consisting of wireless infrastructure to cloud, and machine swarm of tremendous devices. Related technologies toward practical realization are explored to complete fundamental understanding and engineering knowledge of this new communication and networking technology front. We connect data aggregators (DAs) to machine swarm and propose a promising two-tier heterogeneous architecture with DA’s smallworld network for statistical traffic control in large M2M communication networks address efficient dissemination control for routing and QoS such as surveillance applications.
We first analytically supply the condition to establish connected M2M networks and explore some essential geometric properties (i.e., degree distribution, network diameter, and average distance) for the networks. Analytic bounds of average distance characterize the average number of hops that machines’ packets need to traverse over the swarm, thus dominating the QoS guarantee capability for reliable communications. Furthermore, through G/G/1 (i.e., both inter-arrival time and service time distributions of a traffic queue are arbitrary distributions) queuing network model for traffic modeling, the practical data transportation takes place in connected M2M networks.
Aiming at statistical performance in
large M2M networks, we propose a statistical control mechanism for the networks
by establishing the heterogeneous network architecture and exploiting statistical
QoS guarantee for end-toend transmissions without the need of feedback control
at each link. By forming DA’s network with small-world property and linking
machines to DAs, this novel heterogeneous architecture significantly improves
the performance of end-to-end traffic for tolerable delay and makes dependable
communications possible from guaranteing traffic QoS, with extremely simple network
operation for each machine.
2.2.1 ADVANTAGES:
- To understand geometric properties of large M2M networks and thus benchmark performance, we first analytically examine network connectivity, degree, distribution, network diameter, and average distance under Poisson Point Process (PPP) machine distribution.
- Introducing queuing network theory into such network analysis for practical data transportation, the average delay and achievable throughput for message delivery in connected M2M networks are analytically obtained as well as the QoS guaranteed throughput in real applications.
- Standing on hereby established analysis, statistical dissemination control is proposed that incorporates DA’s network with machine swarm (or sensor swarm) for favorable heterogeneous network architecture.
- Due to infeasible end-to-end information exchange and subsequent precise control, we exploit statistical QoS guarantees over two-tier heterogeneous network architecture to exhibit remarkable enhancement of system performance, and to facilitate the merits of small-world phenomenon into engineering reality.
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Script : Java Script
- Tools : Netbeans 7
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
3.5 ACTIVITY DIAGRAM:
CHAPTER 4
4.0 IMPLEMENTATION:
GEOMETRIC RANDOM GRAPH (GRG) :
M2M communication network consists of tremendous self organized machines/sensors and enables autonomous connections among different applications for ubiquitous communications upon such large swarm system. To facilitate this scenario into practice, providing the connectivity accompanied with reliable transportation is a must for such large network. In the following, we highlight the relevant research and introduce the M2M network model using geometric random graph (GRG) as its topology and local clustering property are suitable for benchmarking large wireless ad hoc sensor networks.
Without the need of end-to-end
information to escape catastrophic complexity, information dissemination
becomes the only way in machine swarm. We exploit an open G/G/1 queuing
network model for delay and throughput analysis of M2M networks. Furthermore,
the diffusion approximation is used to analyze the queuing network. Our analytical
methodology to deal with wireless networks have general inter-arrival and
service time distributions by providing closed form expressions of end-to-end
delay and maximum achievable throughput per node. In the following, to fully
understand practical data transportation, we present the traffic model and an
equivalent queuing network model in connected M2M networks.
4.1 ALGORITHM
M2M ROUTING ALGORITHM:
M2M routing algorithm, this paper studies the asymptotic performance of several statistical QoS requirements, such as end-to-end delay and maximum throughput as well as the throughput under guaranteed delay, for a general forwarding scheme inM2M network. What is more important, our previous work focuses on obtaining the traffic performance under a specific scenario setting, which can simplify the analysis, while failing to maintain the same level of transmission qualities when the scenario changes, e.g., the network topology or traffic pattern becomes different.
Proposed algorithms solve this challenge
through statistical dissemination control by leveraging the heterogeneous
network architecture. In particular, the upper layer of DAs’ network enables
shortcut transmissions to reduce the excess end-to-end delay from the long
route transmissions in the lower layer of machine swarm. A comprehensive
performance analysis upon such a heterogeneous architecture is also included in
this paper. With these accomplishments, we provide an original and significant
paradigm to facilitate M2M communications, practically realizing information
dissemination control to meet the need of time sensitive applications in next-generation
wireless standards.
4.2 MODULES:
NETWORK TOPOLOGY DESIGN:
SERVER CLIENT MODULE:
STATISTICAL QOS GUARANTEE:
M2M COMMUNICATION CONTROL:
END-TO-END DELAY ANALYSIS:
4.3 MODULE DESCRIPTION:
NETWORK TOPOLOGY DESIGN:
This module is developed to wireless mesh based Topology design all node place particular distance. Without using any cables then fully wireless equipment based transmission and received packet data. Node and wireless sensor between calculate distance and transmission range then physically all nodes interconnected. The sink is at the center of the circular sensing area.
This module is developed to node creation and more than 20 nodes placed particular distance. Wireless sensor placed intermediate area. Each node knows its location relative to the sink. Each node is programmed with the total number of nodes in the network.
SERVER CLIENT MODULE:
Client-server
computing or networking is a distributed application architecture that
partitions tasks or workloads between service providers (servers) and service
requesters, called clients. Often clients and servers operate over a computer
network on separate hardware. A server machine is a high-performance host that
is running one or more server programs which share its resources with clients.
A client also shares any of its resources; Clients therefore initiate
communication sessions with servers which await (listen to) incoming requests.
STATISTICAL QOS GUARANTEE:
M2M COMMUNICATION CONTROL:
M2M communication with low data rate and energy cost, the machine-to-DA communication with medium data rate, and the DA-to-DA communication with high data rate. We adopt the related values from as shown in Table II and set up the experiment as follows. The 1 Mb data is sent from the source machine to the destination machine in both plain machine swarm and heterogeneous architecture separately. Moreover, DAs’ communication capabilities are characterized as the number of machines z that can be served simultaneously by each single DA.
DAs for heterogeneous architecture with respect to the number of machines in the DA’s capability linearly increases, the required number of DAs drops exponentially. It suggests that few powerful DAs are preferable than bunch of DAs with limited capability. Furthermore, Fig. 10 shows the average end-to-end delay with respect to different area sizes of Metropolis. As the area size increases (so does the number of machines in each block), the heterogeneous architecture supports much less traffic delay than the plain machine swarm.
For example, with the area size 60 km2
and 108 machines, the delay from heterogeneous architecture is 115 s as
compared to 2,500 s from the swarm. Moreover, the linear curves in the log
scale of Fig. 10(b) confirms our asymptotic results, and suggest that the heterogeneous architecture outperforms the
plain machine swarm with about 95% delay reduction for 10 billion machines. To conclude,
by efficiently connecting few DAs to construct small world shortcuts, proposed
statistical control accompanied with heterogeneous architecture resolves the
undependable end-to end transmissions.
END-TO-END DELAY ANALYSIS:
We compare the performance of the proposed heterogeneous network architecture with plain machine swarm. Simulation results confirm that heterogeneous architecture achieves remarkable delay reduction as well as high throughput gain with only few DAs installed, favored by practical implementation in large M2M networks. All simulation parameters and value settings are listed in Table I. In particular, to ensure every packet could be sent to its corresponding destination from the source, a connected M2M network is first established via the proposed analysis (i.e., selecting the appropriate machine communication range r with respect to the total machine number n). When a source machine generates a packet, it routes the packet to a specific destination, uniformly selected among other machines.
Moreover, for plain machine swarm, source simply hops forward based on the sensing and relaying; for heterogeneous architecture, it employs dissemination without selecting a particular DA. In the following, we first evaluate average distance to DAs and end-to-end distance for plain machine swarm and heterogeneous architecture. Next, end-toend packet delay, maximum system throughput, and throughput under guaranteed delay are thoroughly examined for such different architecture and compared with simulation validation in the Metropolis is established to facilitate our design into an even more practical stage.
CHAPTER 8
8.1 CONCLUSION AND FUTURE WORK:
In this paper, we resolve the most critical challenge on providing statistical control for reliable information dissemination over large M2M communication networks. Examining network topology of M2M networks, the geometric properties of such large networks are well studied to analytically characterize message delivery over connected M2M networks.
Moreover, by leveraging queuing network model, the practical data transportation is employed and both the average end-to end delay and maximum achievable throughput for these connected networks are accessible. Based on above explorations, the promising statistical control with sophisticated small-world network of data aggregators and thus the heterogeneous architecture are proposed to establish shortcut paths among machine communications.
Performance evaluation verifies that
instead of exploiting long concatenation of multi-hop transmissions in the
machine swarm, our heterogeneous network architecture enables machines to
communicate through overlaid ultra-fast “highway”, like shortcut in small-world
networks, with desired throughput. It is particularly crucial for
next-generation networks of tremendous amounts of machines. Therefore, we successfully
achieve reliable communications via our proposed methodology and facilitate
novel traffic control in M2M communication networks.
Shared Authority Based Privacy-Preserving Authentication Protocol in Cloud Computing
Shared Authority Based Privacy-PreservingAuthentication Protocol in Cloud ComputingHong Liu, Student Member, IEEE, Huansheng Ning, Senior Member, IEEE,Qingxu Xiong, Member, IEEE, and Laurence T. Yang, Member, IEEEAbstract—Cloud computing is an emerging data interactive paradigm to realize users’ data remotely stored in an online cloudserver. Cloud services provide great conveniences for the users to enjoy the on-demand cloud applications without considering thelocal infrastructure limitations. During the data accessing, different users may be in a collaborative relationship, and thus datasharing becomes significant to achieve productive benefits. The existing security solutions mainly focus on the authentication torealize that a user’s privative data cannot be illegally accessed, but neglect a subtle privacy issue during a user challenging thecloud server to request other users for data sharing. The challenged access request itself may reveal the user’s privacy no matterwhether or not it can obtain the data access permissions. In this paper, we propose a shared authority based privacy-preservingauthentication protocol (SAPA) to address above privacy issue for cloud storage. In the SAPA, 1) shared access authority isachieved by anonymous access request matching mechanism with security and privacy considerations (e.g., authentication, dataanonymity, user privacy, and forward security); 2) attribute based access control is adopted to realize that the user can only accessits own data fields; 3) proxy re-encryption is applied to provide data sharing among the multiple users. Meanwhile, universalcomposability (UC) model is established to prove that the SAPA theoretically has the design correctness. It indicates that theproposed protocol is attractive for multi-user collaborative cloud applications.Index Terms—Cloud computing, authentication protocol, privacy preservation, shared authority, universal composabilityÇ1 INTRODUCTIONCLOUD computing is a promising information technologyarchitecture for both enterprises and individuals. Itlaunches an attractive data storage and interactive paradigmwith obvious advantages, including on-demand selfservices,ubiquitous network access, and location independentresource pooling [1]. Towards the cloud computing, atypical service architecture is anything as a service (XaaS),in which infrastructures, platform, software, and others areapplied for ubiquitous interconnections. Recent studieshave been worked to promote the cloud computing evolvetowards the internet of services [2], [3]. Subsequently, securityand privacy issues are becoming key concerns with theincreasing popularity of cloud services. Conventional securityapproaches mainly focus on the strong authenticationto realize that a user can remotely access its own data in ondemandmode. Along with the diversity of the applicationrequirements, users may want to access and share each other’sauthorized data fields to achieve productive benefits,which brings new security and privacy challenges for thecloud storage.An example is introduced to identify the main motivation.In the cloud storage based supply chain management,there are various interest groups (e.g., supplier, carrier, andretailer) in the system. Each group owns its users which arepermitted to access the authorized data fields, and differentusers own relatively independent access authorities. Itmeans that any two users from diverse groups shouldaccess different data fields of the same file. Thereinto, a suppliermay want to access a carrier’s data fields, but it is notsure whether the carrier will allow its access request. If thecarrier refuses its request, the supplier’s access desire willbe revealed along with nothing obtained towards thedesired data fields. Actually, the supplier may not send theaccess request or withdraw the unaccepted request inadvance if it firmly knows that its request will be refused bythe carrier. It is unreasonable to thoroughly disclose thesupplier’s private information without any privacy considerations.Fig. 1 illustrates three revised cases to addressabove imperceptible privacy issue._ Case 1. The carrier also wants to access the supplier’sdata fields, and the cloud server should inform eachother and transmit the shared access authority to theboth users;_ Case 2. The carrier has no interest on other users’data fields, therefore its authorized data fieldsshould be properly protected, meanwhile the supplier’saccess request will also be concealed;_ Case 3. The carrier may want to access the retailer’sdata fields, but it is not certain whether the retailerwill accept its request or not. The retailer’s authorizeddata fields should not be public if the retailer_ H. Liu and Q. Xiong are with the School of Electronic and InformationEngineering, Beihang University, Beijing, China.E-mail: liuhongler@ee.buaa.edu.cn, qxxiong@buaa.edu.cn._ H. Ning is with the School of Computer and Communication Engineering,University of Science and Technology Beijing, Beijing, China, and theSchool of Electronic and Information Engineering, Beihang University,Beijing, China. E-mail: ninghuansheng@ustb.edu.cn._ L.T. Yang is with the School of Computer Science and Technology,Huazhong University of Science and Technology, Wuhan, Hubei, China,and the Department of Computer Science, St. Francis Xavier University,Antigonish, NS, Canada. E-mail: ltyang@stfx.ca.Manuscript received 3 Nov. 2013; revised 23 Dec. 2013; accepted 30 Dec.2013. Date of publication 24 Feb. 2014; date of current version 5 Dec. 2014.Recommended for acceptance by J. Chen.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2308218IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 2411045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.has no interests in the carrier’s data fields, and thecarrier’s request is also privately hidden.Towards above three cases, security protection and privacypreservation are both considered without revealing sensitiveaccess desire related information.In the cloud environments, a reasonable security protocolshould achieve the following requirements. 1) Authentication:a legal user can access its own data fields, only theauthorized partial or entire data fields can be identified bythe legal user, and any forged or tampered data fields cannotdeceive the legal user. 2) Data anonymity: any irrelevantentity cannot recognize the exchanged data and communicationstate even it intercepts the exchanged messages viaan open channel. 3) User privacy: any irrelevant entity cannotknow or guess a user’s access desire, which represents auser’s interest in another user’s authorized data fields. Ifand only if the both users have mutual interests in each other’sauthorized data fields, the cloud server will inform thetwo users to realize the access permission sharing. 4) Forwardsecurity: any adversary cannot correlate two communicationsessions to derive the prior interrogations accordingto the currently captured messages.Researches have been worked to strengthen security protectionand privacy preservation in cloud applications, andthere are various cryptographic algorithms to addresspotential security and privacy problems, including securityarchitectures [4], [5], data possession protocols [6], [7], datapublic auditing protocols [8], [9], [10], secure data storageand data sharing protocols [11], [12], [13], [14], [15], [16],access control mechanisms [17], [18], [19], privacy preservingprotocols [20], [21], [22], [23], and key management [24],[25], [26], [27]. However, most previous researches focus onthe authentication to realize that only a legal user can accessits authorized data, which ignores that different users maywant to access and share each other’s authorized data fieldsto achieve productive benefits. When a user challenges thecloud server to request other users for data sharing, theaccess request itself may reveal the user’s privacy no matterwhether or not it can obtain the data access permissions. Inthis work, we aim to address a user’s sensitive access desirerelated privacy during data sharing in the cloud environments,and it is significant to design a humanistic securityscheme to simultaneously achieve data access control,access authority sharing, and privacy preservation.In this paper, we address the aforementioned privacyissue to propose a shared authority based privacy-preservingauthentication protocol (SAPA) for the cloud data storage,which realizes authentication and authorization withoutcompromising a user’s private information. The main contributionsare as follows.1) Identify a new privacy challenge in cloud storage,and address a subtle privacy issue during a userchallenging the cloud server for data sharing, inwhich the challenged request itself cannot reveal theuser’s privacy no matter whether or not it can obtainthe access authority.2) Propose an authentication protocol to enhance auser’s access request related privacy, and the sharedaccess authority is achieved by anonymous accessrequest matching mechanism.3) Apply ciphertext-policy attribute based access controlto realize that a user can reliably access its owndata fields, and adopt the proxy re-encryption toprovide temp authorized data sharing among multipleusers.The remainder of the paper is organized as follows.Section 2 introduces related works. Section 3 introduces thesystem model, and Section 4 presents the proposed authenticationprotocol. The universal composability (UC) modelbased formal security analysis is performed in Section 5Finally, Section 6 draws a conclusion.2 RELATED WORKDunning and Kresman [11] proposed an anonymous IDassignment based data sharing algorithm (AIDA) for multipartyoriented cloud and distributed computing systems. Inthe AIDA, an integer data sharing algorithm is designed ontop of secure sum data mining operation, and adopts a variableand unbounded number of iterations for anonymousassignment. Specifically, Newton’s identities and Sturm’stheorem are used for the data mining, a distributed solutionof certain polynomials over finite fields enhances the algorithmscalability, and Markov chain representations are usedto determine statistics on the required number of iterations.Liu et al. [12] proposed a multi-owner data sharingsecure scheme (Mona) for dynamic groups in the cloudapplications. The Mona aims to realize that a user cansecurely share its data with other users via the untrustedcloud server, and can efficiently support dynamic groupinteractions. In the scheme, a new granted user can directlydecrypt data files without pre-contacting with data owners,and user revocation is achieved by a revocation list withoutupdating the secret keys of the remaining users. Access controlis applied to ensure that any user in a group can anonymouslyutilize the cloud resources, and the data owners’real identities can only be revealed by the group managerfor dispute arbitration. It indicates the storage overheadand encryption computation cost are independent with theamount of the users.Fig. 1. Three possible cases during data accessing and data sharing incloud applications.242 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Grzonkowski and Corcoran [13] proposed a zeroknowledgeproof (ZKP) based authentication scheme forcloud services. Based on the social home networks, a usercentric approach is applied to enable the sharing of personalizedcontent and sophisticated network-based servicesvia TCP/IP infrastructures, in which a trusted third partyis introduced for decentralized interactions.Nabeel et al. [14] proposed a broadcast group key management(BGKM) to improve the weakness of symmetrickey cryptosystem in public clouds, and the BGKM realizesthat a user need not utilize public key cryptography, andcan dynamically derive the symmetric keys during decryption.Accordingly, attribute based access control mechanismis designed to achieve that a user can decrypt thecontents if and only if its identity attributes satisfy the contentprovider’s policies. The fine-grained algorithm appliesaccess control vector (ACV) for assigning secrets to usersbased on the identity attributes, and allowing the users toderive actual symmetric keys based on their secrets andother public information. The BGKM has an obviousadvantage during adding/revoking users and updatingaccess control policies.Wang et al. [15] proposed a distributed storage integrityauditing mechanism, which introduces the homomorphictoken and distributed erasure-coded data to enhance secureand dependable storage services in cloud computing. Thescheme allows users to audit the cloud storage with lightweightcommunication overloads and computation cost,and the auditing result ensures strong cloud storage correctnessand fast data error localization. Towards the dynamiccloud data, the scheme supports dynamic outsourced dataoperations. It indicates that the scheme is resilient againstByzantine failure, malicious data modification attack, andserver colluding attacks.Sundareswaran et al. [16] established a decentralizedinformation accountability framework to track the users’actual data usage in the cloud, and proposed an objectcenteredapproach to enable enclosing the logging mechanismwith the users’ data and policies. The Java ARchives(JAR) programmable capability is leveraged to create adynamic and mobile object, and to ensure that the users’data access will launch authentication. Additionally, distributedauditing mechanisms are also provided to strengthenuser’s data control, and experiments demonstrate theapproach efficiency and effectiveness.In the aforementioned works, various security issues areaddressed. However, a user’s subtle access request relatedprivacy problem caused by data accessing and data sharinghas not been studied yet in the literature. Here, we identifya new privacy challenge, and propose a protocol not onlyfocusing on authentication to realize the valid data accessing,but also considering authorization to provide the privacy-preserving access authority sharing. The attributebased access control and proxy re-encryption mechanismsare jointly applied for authentication and authorization.3 SYSTEM MODELFig. 2 illustrates a system model for the cloud storage architecture,which includes three main network entities: users(Ux), a cloud server (S), and a trusted third party._ User. An individual or group entity, which owns itsdata stored in the cloud for online data storage andcomputing. Different users may be affiliated with acommon organization, and are assigned with independentauthorities on certain data fields._ Cloud server. An entity, which is managed by aparticular cloud service provider or cloud applicationoperator to provide data storage and computingservices. The cloud server is regarded as anentity with unrestricted storage and computationalresources._ Trusted third party. An optional and neutral entity,which has advanced capabilities on behalf of theusers, to perform data public auditing and disputearbitration.In the cloud storage, a user remotely stores its data viaonline infrastructures, flatforms, or software for cloud services,which are operated in the distributed, parallel, andcooperative modes. During cloud data accessing, the userautonomously interacts with the cloud server without externalinterferences, and is assigned with the full and independentauthority on its own data fields. It is necessary toguarantee that the users’ outsourced data cannot be unauthorizedaccessed by other users, and is of critical importanceto ensure the private information during the users’data access challenges. In some scenarios, there are multipleusers in a system (e.g., supply chain management), and theusers could have different affiliation attributes from differentinterest groups. One of the users may want to accessother associated users’ data fields to achieve bi-directionaldata sharing, but it cares about two aspects: whether theaimed user would like to share its data fields, and how toavoid exposing its access request if the aimed user declinesor ignores its challenge. In the paper, we pay more attentionon the process of data access control and access authoritysharing other than the specific file oriented cloud datamanagement.In the system model, assume that point-to-point communicationchannels between users and a cloud server are reliablewith the protection of secure shell protocol (SSH). Therelated authentication handshakes are not highlighted inthe following protocol presentation.Towards the trust model, there are no full trust relationshipsbetween a cloud server S and a user Ux._ S is semi-honest and curious. Being semi-honest meansthat S can be regarded as an entity that appropriatelyfollows the protocol procedure. Being curiousFig. 2. The cloud storage system model.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 243means that S may attempt to obtain Ux’s privateinformation (e.g., data content, and user preferences).It means that S is under the supervision of itscloud provider or operator, but may be interested inviewing users’ privacy. In the passive or honest-butcuriousmodel, S cannot tamper with the users’ datato maintain the system normal operation with undetectedmonitoring._ Ux is rational and sensitive. Being rational means thatUx’s behavior would be never based on experienceor emotion, and misbehavior may only occur for selfishinterests. Being sensitive means that Ux is reluctantto disclosure its sensitive data, but has stronginterests in other users’ privacy.Towards the threat model, it covers the possible securitythreats and system vulnerabilities during cloud data interactions.The communication channels are exposed in public,and both internal and external attacks exist in the cloudapplications [15]. The internal attacks mainly refer to theinteractive entities (i.e., S, and Ux). Thereinto, S may be selfcenteredand utilitarian, and aims to obtain more user datacontents and the associated user behaviors/habits for themaximization of commercial interests; Ux may attempt tocapture other users’ sensitive data fields for certain purposes(e.g., curiosity, and malicious intent). The externalattacks mainly consider the data CIA triad (i.e., confidentiality,integrity, and availability) threats from outside adversaries,which could compromise the cloud data storageservers, and subsequently modify (e.g., insert, or delete) theusers’ data fields.4 THE SHARED AUTHORITY BASED PRIVACYPRESERVINGAUTHENTICATION PROTOCOL4.1 System InitializationThe cloud storage system includes a cloud server S, andusers {Ux} (x ¼ f1; . . .;mg, m 2 N_). Thereinto, Ua and Ubare two users, which have independent access authoritieson their own data fields. It means that a user has an accesspermission for particular data fields stored by S, and theuser cannot exceed its authority access to obtain other users’data fields. Here, we consider S and {Ua, Ub} to present theprotocol for data access control and access authority sharingwith enhanced privacy considerations. The main notationsare introduced in Table 1.Let BG ¼ ðq; g; h;G;G0; e;HÞ be a pairing group, in whichq is a large prime, {G;G0} are of prime order q, G ¼ hgi ¼ hhi,and H is a collision-resistant hash function. The bilinearmap e : G _ G ! G0 satisfies the bilinear non-degenerateproperties: i.e., for all g; h 2 G and a; b 2 Z_q , it turns out thateðga; hbÞ ¼ eðg; hÞab, and eðg; hÞ 6¼ 1. Meanwhile, eðg; hÞ canbe efficiently obtained for all g; h 2 G, and it is a generatorof G0.Let S and Ux respectively own the pairwise keys {pkS,skS} and {pkUx , skUx }. Besides, S is assigned with all users’public keys {pkU1 ; . . . ; pkUm}, and Ux is assigned with pkS.Here, the public key pkt ¼ gskt ðmod qÞ (t 2 fS;Uxg) and thecorresponding privacy key skt 2 Z_q are defined accordingto the generator g.Let FðRUyUx ðRUxUy ÞT Þ¼Cont2Zq describe the algebraic relation of{RUyUx , RUxUy }, which are mutually inverse access requests challengedby {Ux, Uy}, and Cont is a constant. Here, Fð:Þ is acollision-resistant function, for any randomized polynomialtime algorithm A, there is a negligible function pðkÞ for asufficiently large value k:Probhfðx; x0Þ; ðy; y0Þg Að1kÞ : ðx 6¼ x0; y 6¼ y0Þ^F_RUxUy_RU0yU0x_T_¼ Conti_ pðkÞ:Note that RU_ Uyis a m-dimensional Boolean vector, inwhich only the _-th pointed-element and the y-th selfelementare 1, and other elements are 0. It turns out that:_ FðRUyUx ðRUxUy ÞT Þ¼Fð2Þ¼Cont means that both Ux and Uy areinterested in each other’s data fields, and the twoaccess requests are matched;_ FðRUyUx ðRU~xUy ÞTÞ ¼ FðRU~yUx ðRUxUy ÞTÞ ¼ Fð1Þ means thatonly one user (i.e., Ux or Uy) is interested in theother’s data fields, and the access requests are notmatched. Note that U~x/U~y represents that the user isnot Ux/Uy;_ FðRU~yUx ðRU~xUy ÞTÞ ¼ Fð0Þ means that neither Ux nor Uy isinterested in each other’s data fields, and the twoaccess requests are not matched.Let A be the attribute set, there are n attributesA ¼ fA1;A2; . . .; Ang for all users, and Ux has its own attributeset AUx _ A for data accessing. Let AUx and PUx bemonotone Boolean matrixes to represent Ux’s data attributeaccess list and data access policy._ Assume that Ux has AUx ¼ ½aij_n_m, which satisfiesthat aij ¼ 1 for Ai 2 A, and aij ¼ 0 for Ai =2 A._ Assume that S owns PUx ¼ ½pij_n_m, which is appliedto restrain Ux’s access authority, and satisfies thatpij ¼ 1 for Ai 2 PUx , and pij ¼ 0 for Ai =2 PUx. Ifaij _ pij8i ¼ f1; . . . ; ng; j ¼ f1; . . .;mg holds, it willbe regarded that AUx is within PUx ’s access authoritylimitation.TABLE 1Notations244 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Note that full-fledged cryptographic algorithms (e.g.,attribute based access control, and proxy re-encryption) canbe exploited to support the SAPA.4.2 The Proposed Protocol DescriptionsFig. 3 shows the interactions among {Ua, Ub, S}, in whichboth Ua and Ub have interests on each other’s authorizeddata fields for data sharing. Note that the presented interactionsmay not be synchronously launched, and a certaintime interval is allowable.4.2.1 {Ua, Ub}’s Access Challenges and S’s Responses{Ua, Ub} respectively generate the session identifiers {sidUa ,sidUb }, extract the identity tokens {TUa , TUb }, and transmits{sidUakTUa , sidUbkTUa} to S as an access query to initiate anew session. Accordingly, we take the interactions of Uaand S as an example to introduce the following authenticationphase. Upon receiving Ua’s challenge, S first generatesa session identifier sidSa , and establishes the master publickey mpk ¼ ðgi; h; hi; BG; eðg; hÞ;HÞ and master privacy keymsk ¼ ða; gÞ. Thereinto, S randomly chooses a 2 Zq, andcomputes gi ¼ gaiand hi ¼ hai_1(i ¼ f1; . . . ; ng 2 Z_).S randomly chooses s 2 f0; 1g_, and extracts Ua’s accessauthority policy PUa ¼ ½pij_n_m (pij 2 f0; 1g), and Ua isassigned with the access authority on its own data fieldsDUa within PUa ’s permission. S further defines a polynomialFSa ðx; PUa Þ according to PUa and TUa :FSa ðx; PUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞpij ðmod qÞ:S computes a set of values {MSa0, MSa1, fMSa2ig, MSa3,MSa4} to establish the ciphertext CSa ¼ fMSa1; fMSa2ig;MSa3;MSa4g, and transmits sidSakCSa to Ua.MSa0 ¼ HðPUakDUakTUaksÞ;MSa1 ¼ hFSa ða;PUa ÞMSa0 ;MSa2i ¼ ðgiÞMSa0 ; ði ¼ 1; . . . ; nÞ;MSa3 ¼ Hðeðg; hÞMSa0Þ s;MSa4 ¼ HðsidUaksÞ DUa :Similarly, S performs the corresponding operationsfor Ub, including that S randomly chooses a0 2 Zq ands02 f0; 1g_, establishes {g0i, h0i}, extracts {PUb , DUb },defines FSb ðx; PUb Þ, and computes {MSb0, MSb1, fMSb2ig,MSb3, MSb4} to establish the ciphertext CSb fortransmission.4.2.2 {Ua, Ub}’s Data Access ControlUa first extracts it data attribute access list AUa ¼ ½aij_(aij 2 f0; 1g, aij _ pij) to re-structure an access listLUa ¼ ½lij_n_m for lij ¼ pij _ aij. Ua also defines a polynomialFUa ðx;LUa Þ according to LUa and TUa :FUa ðx;LUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞlij ðmod qÞ:It turns out that FUa ðx;LUa Þ satisfies the equationFUa ðx;LUaÞ ¼Yn;mi¼1;j¼1ðx þ ijHðTUa ÞÞpij_aij¼ FSa ðx; PUa Þ=FSa ðx;AUa Þ:Afterwards, Ua randomly chooses b 2 Zq, and the decryptionkey kAUa for AUa can be obtained as follows:kAUa ¼ ðgðbþ1Þ=FSa ða;AUa Þ; hb_1Þ:Ua further computes a set of values {NUa1, NUa2, NUa3}.Here, fSai is used to represent xi’s coefficient inFSa ðx; PUa Þ, and fUai is used to represent xi’s coefficientin FUa ðx; LUa Þ:NUa1 ¼e MSa21;Yni¼1ðhiÞfUaihfUa0!;NUa2 ¼ eYni¼1ðMSa2iÞfUai; hb_1!;NUa3 ¼ eðgðbþ1Þ=FSa ða;AUa Þ;MSa1Þ:Fig. 3. The shared authority based privacy-preserving authentication protocol.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 245It turns out that eðg; hÞMSa0 satisfies the equationeðg; hÞMSa0 ¼NUa3ðNUa1NUa2Þ_ _1=fUa0:For the right side of (1), we have,NUa1 ¼ egaiMSa0 ;Yni¼1ðhiÞfUaihfUa0!¼ eðg; hÞaMSa0Pni¼1ðai_1fUaiþfUa0Þ¼ eðg; hÞMSa0FUa ða;LUa Þ;NUa2 ¼ eYni¼1gaiMSa0fUai; hb_1!¼ eðg; hÞMSa0_Pni¼1aifUaiþfUa0_fUa0_ðb_1Þ¼ eðg; hÞMSa0bFUa ða;LUaÞ_MSa0fUa0 ;NUa3 ¼ egðbþ1Þ=FSa ða;AUa Þ; hfSa0MSa0Yni¼1ðhiÞfSaiMSa0!¼ eðg; hÞðbþ1Þ=FSa ða;AUa ÞFSa ða;PUa ÞMSa0¼ eðg; hÞMSa0bFUa ða;LUaÞþMSa0FUa ða;LUa Þ:Ua locally re-computes {s‘, M‘Sa0}, derives its own authorizeddata fields DUa , and checks whether the ciphertext CSais encrypted by M‘Sa0. If it holds, Ua will be a legal user thatcan properly decrypt the ciphertext CSa ; otherwise, the protocolwill terminates‘ ¼ MSa3 Hðeðg; hÞMSa0 Þ;M‘Sa0 ¼ H_PUakDUakTUaks‘_;DUa ¼ MSa4 H_sidUaks‘_:Ua further extracts its pseudonym PIDUa , a sessionsensitiveaccess request RUbUa, and the public key pkUa .Here, RUbUa is introduced to let S know Ua’s data accessdesire. It turns out that RUbUa makes S know the facts: 1) Uawants to access Ub’s temp authorized data fields _DUb ;2) Ra will also agree to share its temp authorized datafields _DUa with Ub in the case that Ub grants its request.Afterwards, Ua randomly chooses rUa 2 Z_q , computes aset of values {MUa0, MUa1, MUa2, MUa3} to establish a ciphertextCUa , and transmits CUa to S for further access requestmatchingMUa0 ¼ HðsidSakPIDUaÞ RUbUa;MUa1 ¼ gpkUa rUa ;MUa2 ¼ eðg; hÞrUa ;MUa3 ¼ hrUa :Similarly, Ub performs the corresponding operations,including that Ub extracts AUb , and determines {LUb ,FUb ðx;LUb Þ, fUbi}. Ub further randomly chooses b0 2 Zq, andcomputes the values {NUb1, NUb2, NUb3, s0‘, M‘Ub} to derive itsown data fields DUb . Ub also extracts its pseudonym PIDUband an access request RUaUbto establish a ciphertext CUb withthe elements {MUb0;MUb1;MUb2;MUb3}.4.2.3 {Ua, Ub}’s Access Request Matching and DataAccess Authority SharingUpon receiving the ciphertexts {CUa , CUb } within an allowabletime interval, and S extracts {PIDUa , PIDUb } to derivethe access requests {RUbUa , RUaUb}:RUbUa ¼ HðsidSakPIDUaÞ MUa0;RUaUb ¼ HðsidSbkPIDUbÞ MUb0:S checks whether {RUbUa , RUaUb} satisfy FðRUbUa ðRUaUb ÞTÞ ¼Fð2Þ ¼ Cont. If it holds, S will learn that both Ua and Ubhave the access desires to access each other’s authorizeddata, and to share its authorized data fields with each other.S extracts the keys {skS, pkUa , pkUb } to establish the aggregatedkeys {kS, kSu } by the Diffie-Hellman key agreement,and computes the available re-encryption key kUu for Uu(u 2 fa; bg):kS ¼ ðpkUapkUb ÞskS ¼ gðskUaþskUb ÞskS ;kSu ¼ ðpkUu ÞskS ¼ gskUuskS ;kUu ¼ kSu=pkUu :S performs re-encryption to obtainM0Uu1. Towards Ua/Ub,S extracts Ub/Ua’s temp authorized data fields _DUb/ _DUa tocomputeM0Ub2/M0Ua2:M0Uu1 ¼ ðMUu1ÞkUu ¼ gkSurUu ;M0Ua2 ¼ MUa2EkSb ð _DUa Þ;M0Ub2 ¼ MUb2EkSa ð _DUb Þ:Thereafter, S establishes the re-structured ciphertextC0Uu ¼ ðM0Uu1;M0Uu2;MUu3Þ, and respectively transmits{C0UbkkS, C0UakkS} to {Ua, Ub} for access authority sharing.Upon receiving the messages, Ua computes kSa ¼ ðpkSÞskUa ,and performs verification by comparing the followingequation:e_M0Ub1; h_¼?eðgkS=kSa;MUb3Þ:For the left side of (2), we have,e_M0Ub1; h_¼ e_ggskUbskS rUb ; h_:For the right side of (2), we have,e_gkS=kSa;MUb3_¼ eðgðpkSÞskUb ; hrUb Þ¼ eðg; hÞgskSskUb rUb :246 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Ua derives Ub’s temp authorized data fields _DUb :_DUb ¼ E_1kSa_M0Ub2e_M0Ub1; h__kSa=kS_:Similarly, Ub performs the corresponding operations,including that Ub obtains the keys {kS, kSb }, checks Ub’svalidity, and derives the temp authorized data field _DUa .In the SAPA, S acts as a semi-trusted proxy to realize{Ua, Ub}’s access authority sharing. During the proxy reencryption,{Ua, Ub} respectively establish ciphertexts{MUa1, MUb1} by their public keys {pkUa , pkUb }, and S generatesthe corresponding re-encryption keys {kUa , kUb} for {Ua,Ub}. Based on the re-encryption keys, the ciphertexts {MUa1,MUb1} are re-encrypted into {M0Ua1, M0Ub1}, and {Ua, Ub} candecrypt the re-structured ciphertexts {M0Ub1, M0Ua1} by theirown private key {skUa , skUb } without revealing any sensitiveinformation.Till now, {Ua, Ub} have realized the access authority sharingin the case that both Ua and Ub have the access desireson each other’s data fields. Meanwhile, there may be othertypical cases when Ua has an interest in Ub’s data fields witha challenged access request RUbUa .1. In the case that Ub has no interest in Ua’s data fields,it turns out that Ub’s access request RUbUband RUbUa satisfythat FðRUbUa ðRUbUb ÞT Þ¼Fð1Þ. For Ua, S will extract adummy data fields Dnull as a response. Ub will beinformed that a certain user is interested in its datafields, but cannot determine Ua’s detailed identityfor privacy considerations.2. In the case that Ub has an interest in Uc’s data fieldsrather than Ua’s data fields, but Uc has no interest inUb’s data fields. It turns out that the challengedaccess requests RUbUa , RUcUb, and RU~bUc satisfy thatFðRUbUa ðRUcUb ÞT Þ¼FðRUcUb ðRU~bUc ÞT Þ¼Fð1Þ, in which U~b indicatesthat the user is not Ub. Dnull will be transmitted to{Ua, Ub, Uc} without data sharing.In summary, the SAPA adopts integrative approaches toaddress secure authority sharing in cloud applications._ Authentication. The ciphertext-policy attribute basedaccess control and bilinear pairings are introducedfor identification between Uu and S, and only thelegal user can derive the ciphertexts. Additionally,Uu checks the re-computed ciphertexts according tothe proxy re-encryption, which realizes flexible datasharing instead of publishing the interactive users’secret keys._ Data anonymity. The pseudonym PIDUu are hiddenby the hash function so that other entities cannotderives the real values by inverse operations.Meanwhile, U~u ’s temp authorized fields _DU~uareencrypted by kSu for anonymous data transmission.Hence, an adversary cannot recognize thedata, even if the adversary intercepts the transmitteddata, it will not decode the full-fledged cryptographicalgorithms._ User privacy. The access request pointer (e.g., RUxUu) iswrapped along with HðsidSukPIDUu Þ for privatelyinforming S about Uu’s access desires. Only if bothusers are interested in each other’s data fields, S willestablish the re-encryption key kUu to realize authoritysharing between Ua and Ub. Otherwise, S willtemporarily reserve the desired access requests for acertain period of time, and cannot accurately determinewhich user is actively interested in the otheruser’s data fields._ Forward security. The dual session identifiers {sidSu ,sidUu } and pseudorandom numbers are introducedas session variational operators to ensure the communicationsdynamic. An adversary regards theprior session as random even if {S, Uu} get corrupted,or the adversary obtains the PRNG algorithm. Thecurrent security compromises cannot correlate withthe prior interrogations.5 FORMAL SECURITY ANALYSIS WITH THEUNIVERSAL COMPOSABILITY MODEL5.1 PreliminariesThe universal composability model specifies an approachfor security proofs [28], and guarantees that the proofs willremain valid if the protocol is modularly composed withother protocols, and/or under arbitrary concurrent protocolexecutions. There is a real-world simulation, an ideal-worldsimulation, and a simulator Sim translating the protocolexecution from the real-world to the ideal-world. Additionally,the Byzantine attack model is adopted for securityanalysis, and all the parties are modeled as probabilisticpolynomial-time Turing machines (PPTs), and a PPT captureswhatever is external to the protocol executions. Theadversary controls message deliveries in all communicationchannels, and may perform malicious attacks (e.g., eavesdropping,forgery, and replay), and may also initiate newcommunications to interact with the legal parties.In the real-world, let p be a real protocol, Pi (i ¼ f1; . . . ;Ig 2 N_) be real parties, and A be a real-world adversary. Inthe ideal-world, let F be an ideal functionality, ~ Pi bedummy parties, and ~A be an ideal-world adversary. Z is aninteractive environment, and communicates with all entitiesexcept the ideal functionality F. Ideal functionality acts asan uncorruptable trusted party to realize specific protocolfunctions.Theorem 1. UC Security. The probability, that Z distinguishesbetween an interaction of A with Pi and an interactionof ~A with ~ Pi, is at most negligible probability. We havethat a real protocol p UC-realizes an ideal functionality F,i.e., IdealF; ~ A;Z Realp;A;Z.The UC formalization of the SAPA includes the idealworldmodel Ideal, and the real-world model Real._ Ideal: Define two uncorrupted idea functionalities{Faccess, Fshare}, a dummy party ~ P (e.g., ~ Uu, ~ S,u 2 fa; bg), and an ideal adversary ~ A. { ~ P, ~ A} cannotestablish direct communications. ~ A can arbitrarilyinteract with Z, and can corrupt any dummy party~ P, but cannot modify the exchanged messages._ Real: Define a real protocol pshare (run by a partyP including Uu and S) with a real adversary A andan environment Z. Each real parties canLIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 247communicate with each other, and A can fully controlthe interconnections of P to obtain/modify theexchanged messages. During the protocol execution,Z is activated first, and dual session identifiersshared by all the involved parties reflects theprotocol state.5.2 Ideal FunctionalityDefinition 1. Functionality Faccess. Faccess is an incorruptibleideal data accessing functionality via available channels, asshown in Table 2.In Faccess, a party P (e.g., Uu, S) is initialized (via inputInitialize), and thereby initiates a new session along withgenerating dual session identifiers {sidUu , sidSu }. P followsthe assigned protocol procedure to send (via input Send)and receive (via input Receive) messages. A random numberrPu is generated by P for further computation (via inputGenerate). Data access control is realized by checking{sendð:Þ, recð:Þ, localð:Þ} (via input Access). If P is controlledby an ideal adversary ~ A, four types of behaviors may beperformed: ~ A may record the exchanged messages on listenedchannels, and may forward the intercepted messagesto P (via request Forward); ~ A may record the state ofauthentication between Uu and S to interfere in the normalverification (via request Accept); ~ A may impersonate anlegal party to obtain the full state (via request Forge), andmay replay the formerly intercepted messages to involvethe ongoing communications (via request Replay).Definition 2. Functionality Fshare. Fshare is an incorruptibleideal authority sharing functionality, as shown in Table 3.Fshare is activated by P (via input Activate), and the initializationis performed via Initialize of Faccess. The accessrequest pointers {RUbUa , RUaUb} are respectively published andchallenged by {Ua, Ub} to indicate their desires (via inputChallenge). The authority sharing between {Ua, Ub} is realized,and the desired data fields { _D Ub , _D Ua } are accordinglyobtained by {Ua, Ub} (via input Share). If P is controlled byan ideal adversary ~ A, ~ A may detect the exchanged challengedaccess request pointer RUxUu(via request Listen); ~ Amay record the request pointer to interfere in the normalauthority sharing between Ua and Ub (via requestForge/Replay).In the UC model, Faccess and Fshare formally define thebasic components of the ideal-world simulation._ Party. Party P refers to multiple users Uu (e.g., Ua,Ub), and a cloud server S involved in a session.Through a successful session execution, {Uu, S} establishauthentication and access control, and {Ua, Ub}TABLE 3Ideal Authority Sharing Functionality: FshareTABLE 2Ideal Data Accessing Functionality: Faccess248 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015obtain each other’s temp authorized data fields fordata authority sharing._ Session identifier. The session identifiers sidUu andsidSu are generated for initialization by the environmentZ. The ideal adversary ~ A may control and corruptthe interactions between Uu and S._ Access request pointer. The access request pointer RUxUuis applied to indicate Uu’s access request on Ux’stemp authorized data fields _D Ux .5.3 Real Protocol pshareA real protocol pshare is performed based on the ideal functionalitiesto realize Fshare in Faccess-hybrid model.Upon input ActivateðPÞ at P (e.g., Uu, and S), P is activatedvia Fshare to trigger a new session, in whichInitialize of Faccess is applied for initialization and assignment.{initðsidUu ; UuÞ, initðsidSu ; SÞ} are respectivelyobtained by {Uu, S}. Message deliveries are accordingly performedby inputting Send and Receive. Upon input Sendfrom Uu, Uu records and outputs sendðsidUu ; UuÞ via Faccess.Upon input Receive from S, S obtains recðsidUu ; SÞ viaFaccess. Upon input GenerateðSÞ from S, S randomly choosesa random number rSu to output genðrSu Þ and to establisha ciphertext for access control. Upon input GenerateðUuÞfrom Uu, Uu generates a random number rUu for furtherchecking the validity of {AUu , PUu }. Upon input Access fromUu, Uu checks whether {sendð:Þ, recð:Þ, localð:Þ} are matchedvia Faccess. If it holds, output validðAUu; PUu Þ is valid. Else,output invalidðAUu; PUu Þ and terminate the protocol. Uponinput ChallengeðUxÞ from Uu, Uu generates an accessrequest pointer RUxUu, and outputs challðRUxUu Þ to Ux. Uponinput Send from Uu, Uu computes a message mUu , recordsand outputs sendðmUu ; UuÞ via Faccess, in which RUxUuiswrapped in mUu . Upon input Receive from S, S obtainsrecðmUu ; SÞ for access request matching. Upon inputShareð _D Ub ; UaÞ and Shareð _D Ua ; UbÞ from {Ua, Ub}, S checkswhether {challðRUbUa ; UaÞ, challðRUaUb; UbÞ} are matched. If itholds, output shareð _D Ub ; UaÞ to Ua and shareð _D Ua ; UbÞ to Ubto achieve data sharing. Else, output shareðDnull; UaÞ to Uaand shareðDnull; UbÞ to Ub for regular data accessing.5.4 Security Proof of pshareTheorem 3. The protocol pshare UC-realizes the ideal functionalityFshare in the Faccess-hybrid model.Proof: Let A be a real adversary that interacts with the partiesrunning pshare in the Faccess-hybrid model. Let ~ A bean ideal adversary such that any environment Z cannotdistinguish with a non-negligible probability whether itis interacting with A and pshare in Real or it is interactingwith ~ A and Fshare in Ideal. It means that there is a simulatorSim that translates pshare procedures into Real suchthat these cannot be distinguished by Z.Construction of the ideal adversary ~ A: The ideal adversary~ A acts as Sim to run the simulated copies of Z, A,and P. ~ A correlates runs of pshare from Real into Ideal:the interactions of A and P is corresponding to the interactionsof ~ A and ~ P. The input of Z is forwarded to A asA’s input, and the output of A (after running pshare) iscopied to ~ A as ~ A’s output.Simulating the party P. Uu and S are activated and initializedby Activate and Initialization, and ~ A simulatesas A during interactions._ Whenever ~ A obtains {initðsidPu ; PÞ, genðrPu ; PÞ}from Faccess, ~ A transmits the messages to A._ Whenever ~ A obtains {recð:Þ, sendð:Þ} from Faccess,~ A transmits the messages to A, and forwards A’sresponse forwardðsidPu;mPu ; PÞ to Faccess._ Whenever ~ A obtains {initð:Þ, forwardð:Þ} fromFaccess, S transmits the messages to A, and forwardsA’s response acceptðPÞ to Faccess._ Whenever ~ A obtains challðRUxUu; UuÞ from Fshare, ~ Atransmits the message to A, and forwards A’sresponse listenðRUxUu; UuÞ to Fshare.Simulating the party corruption. Whenever P is corruptedby A, thereby ~ A corrupts the corresponding ~ P. ~ Aprovides A with the corrupted parties’ internal states._ Whenever ~ A obtains accessðDUu Þ from Faccess, ~ Atransmits the message accessðDUu Þ to A, and forwardsA’s response acceptðPÞ to Faccess._ Whenever ~ A obtains challðRUxUu; UuÞ from Fshare, ~ Atransmits the message to A, and forwards A’sresponse shareðDnull; UuÞ to Fshare.Ideal and Real are indistinguishable: Assume that {CS,CUu} respectively indicate the events that corruptions of{S, U}. Z invokes Activate and Initialize to launch aninteraction. The commands Generate and Access areinvoked to transmit accessðDUu Þ to ~ A, and A respondsacceptðPÞ to ~ A. Thereafter, Challenge and Share areinvoked to transmit shareðRUxUu; UuÞ, and A respondsshareðDnull; UuÞ to ~ A. Note that initð:Þ independentlygenerates dual session identifiers {sidUu , sidSu }, and thesimulations of Real and Ideal are consistent eventhough ~ A may intervene to prevent the data access controland authority sharing in Ideal. The pseudorandomnumber generator (introduced in {initð:Þ, genð:Þ}), andthe collision-resistant hash function (introduced in{accessð:Þ, shareð:Þ}) are applied to guarantee that theprobability of the environment Z can distinguish theadversary’s behaviors in Ideal and Real is at most negligible.The simulation is performed based on the fact thatno matter the event CS or CUu occurs or not, Therefore,pshare UC-realizes the ideal functionality Fshare in theFaccess-hybrid model. tu6 CONCLUSIONIn this work, we have identified a new privacy challengeduring data accessing in the cloud computing to achieveprivacy-preserving access authority sharing. Authenticationis established to guarantee data confidentiality anddata integrity. Data anonymity is achieved since thewrapped values are exchanged during transmission. Userprivacy is enhanced by anonymous access requests to privatelyinform the cloud server about the users’ accessLIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 249desires. Forward security is realized by the session identifiersto prevent the session correlation. It indicates that theproposed scheme is possibly applied for privacy preservationin cloud applications.ACKNOWLEDGMENTSThis work was funded by DNSLAB, China Internet NetworkInformation Center, Beijing 100190, China. [28] R. Canetti, “Universally Composable Security: A New Paradigmfor Cryptographic Protocols,” Proc. 42nd IEEE Symp. Foundationsof Computer Science (FOCS ’01), pp. 136-145, Oct. 2001.Hong Liu is currently working toward the PhDdegree at the School of Electronic and InformationEngineering, Beihang University, China. Shefocuses on the security and privacy issues inradio frequency identification, vehicle-to-grid networks,and Internet of Things. Her research interestsinclude authentication protocol design, andsecurity formal modeling and analysis. She is astudent member of the IEEE.Huansheng Ning received the BS degree fromAnhui University in 1996 and the PhD degreefrom Beihang University in 2001. He is a professorin the School of Computer and CommunicationEngineering, University of Science andTechnology Beijing, China. His current researchinterests include Internet of Things, aviationsecurity, electromagnetic sensing and computing.He has published more than 50 papers injournals, international conferences/workshops.He is a senior member of the IEEE.250 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 1, JANUARY 2015Qingxu Xiong received the PhD degree in electricalengineering from Peking University, Beijing,China, in 1994. From 1994 to 1997, he worked inthe Information Engineering Department at theBeijing University of Posts and Telecommunicationsas a postdoctoral researcher. He is currentlya professor in the School of Electrical andInformation Engineering at the Beijing Universityof Aeronautics and Astronautics. His researchinterests include scheduling in optical and wirelessnetworks, performance modeling of wirelessnetworks, and satellite communication. He is a member of the IEEE.Laurence T. Yang received the BE degree incomputer science from Tsinghua University,China, and the PhD degree in computer sciencefrom the University of Victoria, Canada. He is aprofessor in the School of Computer Scienceand Technology at the Huazhong University ofScience and Technology, China, and in theDepartment of Computer Science, St. FrancisXavier University, Canada. His research interestsinclude parallel and distributed computing,and embedded and ubiquitous/pervasive computing.His research is supported by the National Sciences and EngineeringResearch Council and the Canada Foundation for Innovation.He is amember of the IEEE.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.LIU ET AL.: SHARED AUTHORITY BASED PRIVACY-PRESERVING AUTHENTICATION PROTOCOL IN CLOUD COMPUTING 251
Security Optimization of Dynamic Networks with Probabilistic Graph Modeling and Linear Programming
Large organizations need rigorous security tools for analyzing potential vulnerabilities in their networks. However, managing large-scale networks with complex configurations is technically challenging. For example, organizational networks are usually dynamic with frequent configuration changes. These changes may include changes in the availability and connectivity of hosts and other devices, and services added to or removed from the network. Network administrators also need to respond to newly discovered vulnerabilities by applying patches and modifications to the network configuration and security policies, or utilizing defensive security resources to minimize the risk from external attacks. For instance, to prevent a remote attack targeting a host it is useful to analyze the candidate defensive strategies in choosing installation and runtime parameters for one or several intrusion prevention systems. To facilitate a scalable security analysis of organizational networks, attack graphs were proposed. Attack graphs show possible attack paths with respect to a particular network setting, which provide the necessary elements for modeling and improving the security of the network.
Existing work utilizes attack graphs for analyzing the security risks by quantifying attack graphs using a variety of techniques such as Bayesian belief propagation basic laws of probability and vertex ranking algorithms. These models lack a systematic and scalable computation of optimized network configurations. Current attack graph quantification models assume a network with known and fixed configurations in terms of the connectivity, availability and policies of the network services and components disregarding the dynamic nature of modern networks. Moreover, except for a few attempts previous work has solely focused on computing a numerical representation of the risk without addressing the more challenging problem of risk management and reduction.
In this paper, we present a rigorous probabilistic model that measures the security risk as the proba- bility of success in an attack. Our probabilistic model referred to as the success measurement model has three main features: (i) rigorous and scalable model with a clear probabilistic semantic, (ii) computation of risk probabilities with the goal of finding the maximum attack capabilities, and (iii) considering dynamic network features and the availability of mobile devices in the network. As an application of our success measurement model, we formalize the problem of utilizing network security resources as an optimization problem with the goal of computing an optimal placement of security products across a network. Our new contribution is to define this optimization problem and provide an efficient algorithm based on a standard technique called sequential linear programming. Our algorithm is proved to converge and it is scalable to large networks with thousands of components and attack paths.
Our contributions in this paper include:
• A scalable probabilistic model that uses a Bernoulli model to measure the risk in terms of the probability of success to achieve an attack goal.
• An efficient security optimization model, generated based on a quantified attack graph, to compute an optimal placement of security products according to organizational and technical constraints.
• Modeling dynamic network features for a realistic and accurate analysis of the risk associated with modern networks.
The results of our experiments confirm
three key properties of our model. First, the vulnerability values computed
from our model are accurate. Our manual inspection of the results confirms that
the probability values obtained in the experiments correlate to the
vulnerabilities of components in the network. Second, our security improvement
method efficiently finds the optimal placement of security products subject to
constraints. Third, we quantify the additional vulnerabilities introduced by
mobile devices of a dynamic network. Our results indicate that an infected
mobile device within the trusted region creates a preferred attack direction
towards the attack target, which increases the chance of success at the target
host. Our implementation efficiently computes the probabilities throughout
large attack graphs with a quadratic execution performance.
1.3 LITRATURE SURVEY
DYNAMIC SECURITY RISK MANAGEMENT USING BAYESIAN ATTACK GRAPHS
AUTHOR: N. Poolsappasit, R. Dewri, and I. Ray
PUBLISH: IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 1, pp. 61–74, Jan 2012.
EXPLANATION:
Security risk assessment and mitigation
are two vital processes that need to be executed to maintain a productive IT
infrastructure. On one hand, models such as attack graphs and attack trees have
been proposed to assess the cause-consequence relationships between various
network states, while on the other hand, different decision problems have been
explored to identify the minimum-cost hardening measures. However, these risk
models do not help reason about the causal dependencies between network states.
Further, the optimization formulations ignore the issue of resource
availability while analyzing a risk model. In this paper, we propose a risk
management framework using Bayesian networks that enable a system administrator
to quantify the chances of network compromise at various levels. We show how to
use this information to develop a security mitigation and management plan. In
contrast to other similar models, this risk model lends itself to dynamic
analysis during the deployed phase of the network. A multi objective
optimization platform provides the administrator with all trade-off information
required to make decisions in a resource constrained environment.
TIME-EFFICIENT AND COST EFFECTIVE NETWORK HARDENING USING ATTACK GRAPHS
AUTHOR: M. Albanese, S. Jajodia, and S. Noel
PUBLISH: Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, june 2012
EXPLANATION:
Attack graph analysis has been
established as a powerful tool for analyzing network vulnerability. However,
previous approaches to network hardening look for exact solutions and thus do
not scale. Further, hardening elements have been treated independently, which
is inappropriate for real environments. For example, the cost for patching many
systems may be nearly the same as for patching a single one. Or patching a
vulnerability may have the same effect as blocking traffic with a firewall,
while blocking a port may deny legitimate service. By failing to account for
such hardening interdependencies, the resulting recommendations can be
unrealistic and far from optimal. Instead, we formalize the notion of hardening
strategy in terms of allowable actions, and define a cost model that takes into
account the impact of interdependent hardening actions. We also introduce a
near-optimal approximation algorithm that scales linearly with the size of the
graphs, which we validate experimentally.
MINIMUM-COST NETWORK HARDENING USING ATTACK GRAPHS
AUTHOR: L. Wang, S. Noel, and S. Jajodia
PUBLISH: Computer Communications, vol. 29, no. 18, pp. 3812–3824, Nov. 2006. [Online]. Available: http://dx.doi.org/10.1016/j.comcom.2006.06.018
EXPLANATION:
In defending one’s network against
cyber attack, certain vulnerabilities may seem acceptable risks when considered
in isolation. But an intruder can often infiltrate a seemingly well-guarded
network through a multi-step intrusion, in which each step prepares for the
next. Attack
graphs can reveal the
threat by enumerating possible sequences of exploits that can be followed to
compromise given critical resources. However, attack graphs do not directly
provide a solution to remove the threat. Finding a solution by hand is
error-prone and tedious, particularly for larger and less secure networks whose
attack graphs are overly complicated. In this paper, we propose a solution to
automate the task of hardening a network against multi-step intrusions. Unlike
existing approaches whose solutions require removing exploits, our solution is
comprised of initially satisfied conditions only. Our solution is thus more
enforceable, because the initial conditions can be independently disabled,
whereas exploits are usually consequences of other exploits and hence cannot be
disabled without removing the causes. More specifically, we first represent
given critical resources as a logic proposition of initial conditions. We then
simplify the proposition to make hardening options explicit. Among the options
we finally choose solutions with the minimum cost. The key improvements over the
preliminary version of this paper include a formal framework of the minimum
network hardening problem, and an improved one-pass algorithm in deriving the
logic proposition while avoiding logic loops.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Existing work utilizes attack graphs for analyzing the security risks by quantifying attack graphs using a variety of techniques such as Bayesian belief propagation basic laws of probability and vertex ranking algorithms. These models lack a systematic and scalable computation of optimized network configurations. Current attack graph quantification models assume a network with known and fixed configurations in terms of the connectivity, availability and policies of the network services and components disregarding the dynamic nature of modern networks. Moreover, except for a few attempts previous work has solely focused on computing a numerical representation of the risk without addressing the more challenging problem of risk management and reduction.
Security risk assessment and mitigation are two vital processes that need to be executed to maintain a productive IT infrastructure. On one hand, models such as attack graphs and attack trees have been proposed to assess the cause-consequence relationships between various network states, while on the other hand, different decision problems have been explored to identify the minimum-cost hardening measures. However, these risk models do not help reason about the causal dependencies between network states.
Further, the optimization formulations ignore the issue of resource availability while analyzing a risk model management framework using Bayesian networks that enable a system administrator to quantify the chances of network compromise at various levels to use this information to develop a security mitigation and management plan. In contrast to other similar models, this risk model lends itself to dynamic analysis during the deployed phase of the network. A multi objective optimization platform provides the administrator with all trade-off information required to make decisions in a resource constrained environment.
2.1.1 DISADVANTAGES:
- Except for a few attempts previous work has solely focused on computing a numerical representation of the risk without addressing the more challenging problem of risk management and reduction.
- Assume a network with known and fixed configurations in terms of the connectivity, availability and policies of the network services and components disregarding the dynamic nature of modern networks.
- None of the previous work considers the effect of device availability on open networks. Furthermore, optimized network configurations and improvement in our work has not been previously studied.
- Bayesian methods are powerful in computing unobserved facts, such as predicting possible threats. It remains unclear how Bayesian methods can be used to support variability in attacker’s decisions, device availability, and the effect of mobile devices.
2.2 PROPOSED SYSTEM:
We present a rigorous probabilistic model that measures the security risk as the probability of success in an attack. Our new contribution is to define this optimization problem and provide an efficient algorithm based on a standard technique called sequential linear programming. Our algorithm is proved to converge and it is scalable to large networks with thousands of components and attack paths.
Our experiments confirm three key properties of our model.
First, the vulnerability values computed
from our model are accurate. Our manual inspection of the results confirms that
the probability values obtained in the experiments correlate to the
vulnerabilities of components in the network. Second, our security improvement
method efficiently finds the optimal placement of security products subject to
constraints. Third, we quantify the additional vulnerabilities introduced by
mobile devices of a dynamic network. Our results indicate that an infected
mobile device within the trusted region creates a preferred attack direction
towards the attack target, which increases the chance of success at the target
host. Our implementation efficiently computes the probabilities throughout
large attack graphs with a quadratic execution performance.
2.2.1 ADVANTAGES:
Our probabilistic model referred to as the success measurement model main features:
- Rigorous and scalable model with a clear probabilistic semantic, Computation of risk probabilities with the goal of finding the maximum attack capabilities.
- Efficient security optimization model, generated based on a quantified attack graph, to compute an optimal placement of security products according to organizational and technical constraints.
- Considering dynamic network features and the availability of mobile devices in the network as an application of our success measurement model, we formalize the problem of utilizing network.
- Security
resources as an optimization problem with the goal of computing an optimal
placement of security products across a network. Modeling dynamic network
features for a realistic and accurate analysis of the risk associated with
modern networks.
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MS-Access 2007
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
Nearest Router |
3.5 ACTIVITY DIAGRAM:
CHAPTER 4
4.0 IMPLEMENTATION:
ECSA ATTACK MODEL
Our probabilistic quantification model, referred to as success measurement model, quantifies the vulnerabilities of networked components and resources, by computing the expected chance of successful attack (ECSA) at every attack step, which is represented by an attack graph node. Our security improvement model uses the computed probabilities from the success measurement model to find optimal security defense strategies given a set of available options in the success measurement model requires three sets of inputs, which are a set of attack steps, a set of network configuration and potential vulnerabilities, and a set of ground facts. The first set includes the steps necessary to execute a targeted attack in a network.
These steps represent intermediate attack goals such as compromising a machine that has an internal connectivity with a targeted server. In addition, the attack steps also describe the various parallel choices available to an attack when achieving a specific target. The second set includes the network configurations and vulnerability data that collectively provide host software installations, inter host connectivity, running services and connections, and known or potential software vulnerabilities. The third set contains the ground fact values that describe the vulnerability, availability, and connectivity of various network configurations.
In our implementation, the first two sets of inputs (i.e., the attack steps and the network configuration data) are taken from dependency attack graphs. The system administrators use vulnerability assessment tools to explore the configurations and vulnerability data in their networks. The output of such assessment is provided as an input to attack graph generation tools. Attack graph generation tools (such as MulVAL often include customized predefined attack step rules that are applied to the configurations and vulnerability data of a network and produce a plain (that is, not quantified) attack graph.
Our model is to develop a set of ground fact values bootstrap the computation of success probabilities throughout an attack graph. The output of the computation based on our success measurement model is the input to the security optimization model (Figure 1). Using the security improvement model, we transform the quantified attack graph from the success measurement model into a mathematical program.
The resulting mathematical program includes an additional set of data that represent various network security defense strategies. In the tool that we developed, the security administrators simply feed this information as logical predicates such as ips_installed(T, E), which describes a potential installation of an intrusion prevention system of type T and security effectiveness E. The effectiveness value E is a score estimated by the system administrator based on prior experiences and available effectiveness data.
We present our success measurement model
to compute the expected chance of a successful attack on a network with respect
to the attack’s ultimate goal. We first present the definitions of the expected
chance of a successful attack (ECSA) followed by the description of an
efficient method to compute ECSA values. Our success measurement model computes
probabilities as a function of initial belief probabilities without the need
for specifying conditional probabilities required by Bayes’ theorem. Our model
measures the success of an attacker based on the attack dependencies determined
by a logical attack graph.
4.1 ALGORITHM
GNU LINEAR PROGRAMMING KIT
We implemented a tool for our computational procedures (Section 4.3) in Java (with approximately 3500 lines of code). We use (GNU Linear Programming Kit) GLPK, a well known open source linear programming API for our SLP-based procedure. Our tool parses an attack graph input file (obtained from MulVAL, computes the ECSA values according to various parameters, and performs security improvement analysis based on a set of improvement options and constraints.
We demonstrate the performance of our implementation. For each graph, we repeat the corresponding experiment to measure the time to compute the final expected chance of a successful attack at the graph’s root vertex. We compute ECSA values for the target graphs using our tool. We run our tool as a single threaded program on a machine with a 2.4 GHz Intel Core i7 processor and a 8 GB DDR3 memory. All our experiments converged with at most 20 iterations towards the solution. On average, 87.99% of the execution time for Procedure 2 is spent on the Taylor expansion from which on average 78.27% of the execution time is spent on symbolic differentiation performed using DJep1 Java library for symbolic operations. The Taylor expansion is parallelizable, and scales with the number of vertices, hence can be done efficiently offline.
SLP LINEAR ALGORITHM
For a network configuration w, let Gw be the corresponding attack graph. The complete procedure to compute the ECSA values of nodes (Definition 2) for an attack graph (Definition 1) is given next. To prepare the attack graph for computation, we execute the following procedure. Our procedureis a technique called sequential linear programming (SLP). SLP is a standard technique for solving nonlinear optimization problems, which is found to be computationally efficient and converges to an optimal solution.
4.2 MODULES:
NETWORK SECURITY:
PROBABILISTIC MODEL:
GENERATING ATTACK GRAPH:
SECURITY
OPTIMIZATION:
4.3 MODULE DESCRIPTION:
NETWORK SECURITY:
Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.
PROBABILISTIC MODEL:
Our probabilistic model referred to as the success measurement model has three main features: (i) rigorous and scalable model with a clear probabilistic semantic, (ii) computation of risk probabilities with the goal of finding the maximum attack capabilities, and (iii) considering dynamic network features and the availability of mobile devices in the network.
Our probabilistic quantification model, referred to as success measurement model, quantifies the vulnerabilities of networked components and resources, by computing the expected chance of successful attack (ECSA) at every attack step, which is represented by an attack graph node. Our security improvement model uses the computed probabilities from the success measurement model to find optimal security defense strategies given a set of available options.
Probabilistic risk assessment is to accurately capture attack step dependencies and correlations. Attack dependencies in the form of attack preconditions are intrinsically captured by our model. That is because we base our analysis on attack graphs that are formed based on the dependency relations among the nodes. Therefore, the probabilities of success are computed by considering the dependency relations determined in an attack graph.
- The focus of our experiments is to practically demonstrate the practicality, feasibility, and accuracy of the model.
- Our experiments include novel features such as analyzing networks with less studied but potentially vulnerable devices such as mobile devices and networked printers. To the best of our knowledge, the experiments in the network analysis literature lack this level of detail.
- Our model will give system administrators a solid analysis of the security in their networks that will assist in actual implementation of security features to downgrade the possibility of successful attack.
GENERATING ATTACK GRAPH:
Attack graph has several goal nodes dependencies is a logical disjunction. In reality, this disjunction indicates that there are multiple attack choices for an attacker towards a specific attack goal. For instance, consider a server with a local privilege escalation vulnerability (which is exploitable remotely in a multistep attack) and runs a network service with multiple remote vulnerabilities. An attacker must exploit one (or more) of these vulnerabilities to gain privileges on the target server. In the lack of observable evidence, one needs to compute the ECSA of a goal node with a function that correctly captures the probabilities of such attack choices. Our approach is to computationally determine attack choice probabilities according to various attack patterns.
SECURITY OPTIMIZATION:
To achieve our main research goal of reducing the probability of success in an attack, and thus optimizing the overall security of the network, we point out the necessity to model this problem as an optimization problem. Further, we attempt to model an important feature that is to consider the availability of machines in the network. In this section we describe these two contributions of our work as summarized below.
Optimizing the security of the networks given a set of security hardening products (e.g., a host based firewall), we compute an optimal distribution of these resources subject to given placement constraints. Using the rigorous probabilistic model introduced in Section 4.1, this is the first work in which a logical attack graph (Definition 1) is transformed into a system of linear and nonlinear equations with the global objective of reducing the probability of success on the graph’s ultimate attack goal. This transformation is performed efficiently and naturally and directly captures our research goal.
Machine availability and the effect of mobile devices:
Our work is the first to show how to represent and assess devices with variable availability (frequently joining and leaving the network), which is one of the characteristics of mobile devices with variable connectivity. Resources for hardening an organizational network, it is important to install a single or a combination of security hardening products so that the expected chance of a successful attack on the network is minimized. To find the best placement of a set of security products in a network, we extend the attack graph to define a security product as a special fact node referred to as an improvement node, which is a fact node that represents a security hardening product, service, practice, or policy. The objective of solving the problem of optimal placement of security products is to compute the effects of various placements of one or more improvement nodes subject to certain constraints and choose the placement that minimizes the attack goal’s ECSA value.
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag
between the cause and the appearance of the problem and the effect of the
system errors on the files and records within the system. A small system error
can conceivably explode into a much larger Problem. Effective testing early in
the purpose translates directly into long term cost savings from a reduced
number of errors. Another reason for system testing is its utility, as a
user-oriented vehicle before implementation. The best programs are worthless if
it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program
represents the logical elements of a system. For a program to run
satisfactorily, it must compile and test data correctly and tie in properly
with other programs. Achieving an error free program is the responsibility of
the programmer. Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file descriptor.
In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 8
8.1 CONCLUSION & FUTURE WORK:
In this work we formalized, implemented, and evaluated a new probabilistic model for measuring the security threats in large enterprise networks. The novelty of our work is the ability to quantitatively analyze the chance of successful attack in the presence of uncertainties about the configuration of a dynamic network and routes of potential attacks.
The results of our experiments confirm three key properties of our model. First, the vulnerability values computed from our model are accurate. Our manual inspection of the results confirms that the probability values obtained in the experiments correlate to the vulnerabilities of components in the network. Second, our security improvement method efficiently finds the optimal placement of security products subject to constraints. Third, we quantify the additional vulnerabilities introduced by mobile devices of a dynamic network.
Our results indicate that an infected mobile device within the trusted region creates a preferred attack direction towards the attack target, which increases the chance of success at the target host. Our implementation efficiently computes the probabilities throughout large attack graphs with a quadratic execution performance.
For future work, we plan to utilize and extend our success measurement model and optimal security placement algorithm to solve more complex network security optimization problems. For instance, an important issue is noise elimination in the initial belief set of values. This is an important problem that if solved will lead to the production of more accurate results.
Secure and Distributed Data Discovery and Dissemination in Wireless Sensor Networks
—A data discovery and dissemination protocol for wireless sensor networks (WSNs) is responsible for updatingconfiguration parameters of, and distributing management commands to, the sensor nodes. All existing data discovery anddissemination protocols suffer from two drawbacks. First, they are based on the centralized approach; only the base station candistribute data items. Such an approach is not suitable for emergent multi-owner-multi-user WSNs. Second, those protocols werenot designed with security in mind and hence adversaries can easily launch attacks to harm the network. This paper proposes thefirst secure and distributed data discovery and dissemination protocol named DiDrip. It allows the network owners to authorizemultiple network users with different privileges to simultaneously and directly disseminate data items to the sensor nodes.Moreover, as demonstrated by our theoretical analysis, it addresses a number of possible security vulnerabilities that we haveidentified. Extensive security analysis show DiDrip is provably secure. We also implement DiDrip in an experimental network ofresource-limited sensor nodes to show its high efficiency in practice.Index Terms—Distributed data discovery and dissemination, security, wireless sensor networks, efficiencyÇ1 INTRODUCTIONAFTER a wireless sensor network (WSN) is deployed,there is usually a need to update buggy/old small programsor parameters stored in the sensor nodes. This can beachieved by the so-called data discovery and dissemination protocol,which facilitates a source to inject small programs,commands, queries, and configuration parameters to sensornodes. Note that it is different from the code disseminationprotocols (also referred to as data dissemination or reprogrammingprotocols) [1], [2], which distribute large binariesto reprogram the whole network of sensors. For example,efficiently disseminating a binary file of tens of kilobytesrequires a code dissemination protocol while disseminatingseveral 2-byte configuration parameters requires data discoveryand dissemination protocol. Considering the sensornodes could be distributed in a harsh environment,remotely disseminating such small data to the sensor nodesthrough the wireless channel is a more preferred and practicalapproach than manual intervention.In the literature, several data discovery and disseminationprotocols [3], [4], [5], [6] have been proposed for WSNs.Among them, DHV [3], DIP [5] and Drip [4] are regarded asthe state-of-the-art protocols and have been included in theTinyOS distributions. All proposed protocols assume thatthe operating environment of the WSN is trustworthy andhas no adversary. However, in reality, adversaries exist andimpose threats to the normal operation of WSNs [7], [8].This issue has only been addressed recently by [7] whichidentifies the security vulnerabilities of Drip and proposesan effective solution.More importantly, all existing data discovery and disseminationprotocols [3], [4], [5], [6], [7] employ the centralizedapproach in which, as shown in the top sub-figurein Fig. 1, data items can only be disseminated by the basestation. Unfortunately, this approach suffers from the singlepoint of failure as dissemination is impossible whenthe base station is not functioning or when the connectionbetween the base station and a node is broken. In addition,the centralized approach is inefficient, non-scalable, andvulnerable to security attacks that can be launched anywherealong the communication path [2]. Even worse,some WSNs do not have any base station at all. For example,for a WSN monitoring human trafficking in acountry’s border or a WSN deployed in a remote area tomonitor illicit crop cultivation, a base station becomes anattractive target to be attacked. For such networks, datadissemination is better to be carried out by authorized networkusers in a distributed manner.Additionally, distributed data discovery and disseminationis an increasingly relevant matter in WSNs, especiallyin the emergent context of shared sensor networks, wheresensing/communication infrastructures from multiple ownerswill be shared by applications from multiple users. Forexample, large scale sensor networks are built in recent_ D. He is with the School of Computer Science and Engineering, South ChinaUniversity of Technology, Guangzhou 510006, P.R. China, and also withthe College of Computer Science and Technology, Zhejiang University,Hangzhou 310027, P.R. China. E-mail: hedaojinghit@gmail.com._ S. Chan is with the Department of Electronic Engineering, City Universityof Hong Kong, Hong Kong SAR, P. R. China.E-mail: eeschan@cityu.edu.hk._ M. Guizani is with Qatar University, Qatar. E-mail: mguizani@ieee.org._ H. Yang is with the School of Computer Science and Engineering, Universityof Electronic Science and Technology of China, P. R. China.E-mail: haomyang@uestc.edu.cn._ B. Zhou is with the College of Computer Science, Zhejiang University,P. R. China. E-mail: zby@zju.edu.cn.Manuscript received 31 Dec. 2013; revised 29 Mar. 2014; accepted 31 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.recommended for acceptance by V. B. Misic.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316830IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 11291045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.projects such as Geoss [9], NOPP [10] and ORION [11].These networks are owned by multiple owners and used byvarious authorized third-party users. Moreover, it isexpected that network owners and different users may havedifferent privileges of dissemination. In this context, distributedoperation by networks owners and users with differentprivileges will be a crucial issue, for which efficient solutionsare still missing.Motivated by the above observations, this paper has thefollowing main contributions:1) The need of distributed data discovery and disseminationprotocols is not completely new, but previouswork did not address this need. We study the functionalrequirements of such protocols, and set theirdesign objectives. Also, we identify the security vulnerabilitiesin previously proposed protocols.2) Based on the design objectives, we propose DiDrip.It is the first distributed data discovery and disseminationprotocol, which allows network owners andauthorized users to disseminate data items intoWSNs without relying on the base station. Moreover,our extensive analysis demonstrates thatDiDrip satisfies the security requirements of theprotocols of its kind. In particular, we apply theprovable security technique to formally provethe authenticity and integrity of the disseminateddata items in DiDrip.3) We demonstrate the efficiency of DiDrip in practiceby implementing it in an experimental WSN withresource-limited sensor nodes. This is also the firstimplementation of a secure and distributed data discoveryand dissemination protocol.The rest of this paper is structured as follows. In Section2, we first survey the existing data discovery anddissemination protocols, and then discuss their securityweaknesses. Section 3 describes the requirements for asecure and distributed extension of such protocols. Section4 presents the network, trust and adversary models.Section 5 describes DiDrip in details. Section 6 providestheoretical analysis of the security properties of DiDrip.Section 7 describes the implementation and experimentalresults of DiDrip via real sensor platforms. Finally,Section 8 concludes this paper.2 SECURITY VULNERABILITIES IN DATADISCOVERY AND DISSEMINATION2.1 Review of Existing ProtocolsThe underlying algorithm of both DIP and Drip is Trickle[12]. Initially, Trickle requires each node to periodicallybroadcast a summary of its stored data. When a node hasreceived an older summary, it sends an update to thatsource. Once all nodes have consistent data, the broadcastinterval is increased exponentially to save energy. However,if a node receives a new summary, it will broadcast thismore quickly. In other words, Trickle can disseminatenewly injected data very quickly. Among the existing protocols,Drip is the simplest one and it runs an independentinstance of Trickle for each data item.In practice, each data item is identified by a unique keyand its freshness is indicated by a version number. Forexample, for Drip, DIP and DHV, each data item is representedby a 3-tuple <key; version; data> , where key is usedto uniquely identify a data item, version indicates the freshnessof the data item (the larger the version, the fresher thedata), and data is the actual disseminated data (e.g., command,query or parameter).2.2 Security VulnerabilitiesAn adversary can first place some intruder nodes in the networkand then use them to alter the data being disseminatedor forge a data item. This may result in some importantparameters being erased or the entire network beingrebooted with wrong data. For example, consider a new dataitem (key, version, data) being disseminated. When anintruder node receives this new data item, it can broadcast amalicious data item (key, version_, data_), where version_ >version. If data_ is set to 0, the parameter identified by key willbe erased from all sensor nodes. Alternatively, if data_ is differentfrom data, all sensor nodes will update the parameteraccording to this forged data item. Note that the aboveattacks can also be launched if an adversary compromisessome nodes and has access to their key materials.In addition, since nodes executing Trickle are required toforward all new data items that they receive, an adversarycan launch denial-of-service (DoS) attacks to sensor nodesby injecting a large amount of bogus data items. As a result,the processing and energy resources of nodes are expendedFig. 1. System overview of centralized and distributed data discovery and dissemination approaches.1130 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015to process and forward these bogus data items, rather thanon the intended functions. Any data discovery and disseminationprotocol based on Trickle or its variants is vulnerableto such a DoS attack.3 REQUIREMENTS AND DESIGN CONSIDERATIONA secure and distributed data discovery and disseminationprotocol should satisfy the following requirements:1) Distributed. Multiple authorized users should beallowed to simultaneously disseminate data itemsinto the WSN without relying on the base station.2) Supporting different user privileges. To provide flexibility,each user may be assigned a certain privilegelevel by the network owner. For example, a user canonly disseminate data items to a set of sensor nodeswith specific identities and/or in a specific localizedarea. Another example is that a user just has the privilegeto disseminate data items identified by somespecific keys.3) Authenticity and integrity of data items. A sensor nodeonly accepts data items disseminated by authorizedusers. Also, a sensor should be able to ensure thatreceived data items have not been modified duringthe dissemination process.4) User accountability. User accountability must be providedsince bad user behaviors and insider attacksshould be audited and pinpointed. That is, a sendershould not be able to deny the distribution of a dataitem. At the same time, an adversary cannot impersonateany legitimate user even if it has compromisedthe network owner or the other legitimateusers. In many applications, accountability is desirableas it enables collection of users’ activities. Forexample, from the dissemination record in sensornodes, the network owner can find out who disseminatesmost data. This requires the sensor nodes to beable to associate each disseminated data with thecorresponding user’s identity.5) Node compromise tolerance. The protocol should beresilient to node compromise attack no matter howmany nodes have been compromised, as long as thesubset of non-compromised nodes can still form aconnected graph with the trusted source.6) User collusion tolerance. Even if an adversary has compromisedsome users, a benign node should notgrant the adversary any privilege level beyond thatof the compromised users.7) DoS attacks resistance. The functions of the WSNshould not be disrupted by DoS attacks.8) Freshness. A node should be able to differentiatewhether an incoming data item is the newestversion.9) Low energy overhead. Most sensor nodes have limitedresources. Thus, it is very important that the securityfunctions incur low energy overhead, which can bedecomposed to communication and computationoverhead.10) Scalability. The protocol should be efficient even forlarge-scale WSNs with thousands of sensors andlarge user population.11) Dynamic participation. New sensor nodes and userscan be dynamically added to the network.In order to ensure security, each step of the existing datadiscovery and dissemination protocol runs should be identifiedand then protected. In other words, although code disseminationprotocols may share the same securityrequirements as listed above, their security solutions needto be designed in accordance with their characteristics. Consideringthe well known open-source code disseminationprotocol Deluge [1] as an example. Deluge uses an epidemicprotocol based on a page-by-page dissemination strategyfor efficient advertisement of metadata. A code image isdivided into fixed-size pages, and each page is further splitinto same-size packets. Due to such a way of decomposingcode images into packets, our proposed protocol is notapplicable for securing Deluge.The primary challenge of providing security functions inWSNs is the limited capabilities of sensor nodes in terms ofcomputation, energy and storage. For example, to provideauthentication function to disseminated data, a commonlyused solution is digital signature. That is, users digitallysign each packet individually and nodes need to verify thesignature before processing it. However, such an asymmetricmechanism incurs significant computational and communicationoverhead and is not applicable to sensor nodes.To address this problem, TESLA and its various extensionshave been proposed [13], [14], which are based on thedelayed disclosure of authentication keys, i.e., the key usedto authenticate a message is disclosed in the next message.Unfortunately, due to the authentication delay, these mechanismsare vulnerable to a flooding attack which causeseach sensor node to buffer all forged data items until thedisclosed key is received.Another possible approach to authentication is by symmetrickey cryptography. However, this approach is vulnerableto node compromise attack because once a node iscompromised, the globally shared secret keys are revealed.Here we choose digital signatures over other forms forupdate packet authentication. That is, the network ownerassigns to each network user a public/private key pair thatallows the user to digitally sign data items and thus authenticateshimself/herself to the sensor nodes. We propose twohybrid approaches to reduce the computation and communicationcost. These methods combine digital signature withefficient data Merkle hash tree and data hash chain, respectively.The main idea is that signature generation and verificationare carried out over multiple packets instead ofindividual packet. In this way, the computation cost perpacket is significantly reduced. Since elliptic curve cryptography(ECC) is computational and communication efficientcompared with the traditional public key cryptography,DiDrip is based on ECC.To prevent the network owner from impersonatingusers, user certificates are issued by a certificate authority ofa public key infrastructure (PKI), e.g., local police office.4 NETWORK, TRUST AND THREAT MODELS4.1 Network ModelAs shown in the bottom subfigure in Fig. 1, a generalWSN comprises a large number of sensor nodes. It isHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1131administrated by the owner and accessible by many users.The sensor nodes are usually resource constrained withrespect to memory space, computation capability, bandwidth,and power supply. Thus, a sensor node can onlyperform a limited number of public key cryptographicoperations during the lifetime of its battery. The networkusers use some mobile devices to disseminate data itemsinto the network. The network owner is responsible for generatingkeying materials. It can be offline and is assumed tobe uncompromisable.4.2 Trust ModelNetworks users are assigned dissemination privileges bythe trusted authority in a PKI on behalf of the networkowner. However, the network owner may, for various reasons,impersonate network users to disseminate data items.4.3 Threat ModelThe adversary considered in this paper is assumed to becomputationally resourceful and can launch a wide rangeof attacks, which can be classified as external or insiderattacks. In external attacks, the adversary has no control ofany sensor node in the network. Instead, it would eavesdropfor sensitive information, inject forged messages,launch replay attack, wormhole attacks, DoS attacks andimpersonate valid sensor nodes. The communication channelmay also be jammed by the adversary, but this can onlylast for a certain period of time after which the adversarywill be detected and removed.By compromising either network users or sensor nodes,the adversary can launch insider attacks to the network. Thecompromised entities are regarded as insiders because theyare members of the network until they are identified. Theadversary controls these entities to attack the network in arbitraryways. For instance, they could be instructed to disseminatefalse or harmful data, launch attacks such as Sybil attacksor DoS attacks, and be non-cooperative with other nodes.5 DIDRIPReferring to the lower sub-figure in Fig. 1, DiDrip consists offour phases, system initialization, user joining, packet preprocessingand packet verification. For our basic protocol,in system initialization phase, the network owner creates itspublic and private keys, and then loads the public parameterson each node before the network deployment. In theuser joining phase, a user gets the dissemination privilegethrough registering to the network owner. In packet preprocessingphase, if a user enters the network and wants todisseminate some data items, he/she will need to constructthe data dissemination packets and then send them to thenodes. In the packet verification phase, a node verifies eachreceived packet. If the result is positive, it updates the dataaccording to the received packet. In the following, eachphase is described in detail. The notations used in thedescription are listed in Table 1. The information processingflow of DiDrip is illustrated in Fig. 2.5.1 System Initialization PhaseIn this phase, an ECC is set up. The network owner carriesout the following steps to derive a private key x and somepublic parameters fy; Q; p; q; hð:Þg. It selects an elliptic curveE over GFðpÞ, where p is a big prime number. Here Qdenotes the base point ofE while q is also a big prime numberand represents the order of Q. It then selects the private keyx 2 GFðqÞ and computes the public key y ¼ xQ. After that,the public parameters are preloaded in each node of the network.We consider 160-bit ECC as an example. In this case, yandQ are both 320 bits long while p and q are 160 bits long.5.2 User Joining PhaseThis phase is invoked when a user with the identity UIDj,say Uj, hopes to obtain privilege level. User Uj chooses theprivate key SKj 2 GFðqÞ and computes the public keyPKj ¼ SKj_Q. Here the length of UIDj is set to 2 bytes, inthis case, it can support 65,536 users. Similarly, assume that160-bit ECC is used, PKj and SKj are 320 bits and 160 bitslong, respectively. Then user Uj sends a 3-tuple <UIDj;Prij; PKj > to the network owner, where Prij denotes thedissemination privilege of user Uj. Upon receiving this message,the network owner generates the certificate Certj. Aform of the certificate consists of the following contents:Certj ¼ fUIDj; PKj; Prij; SIGxfhðUIDjkPKjkPrijÞg, wherethe length of Prij is set to 6 bytes, thus the length of Certj is88 bytes.TABLE 1NotationsFig. 2. Information processing flow in DiDrip.1132 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 20155.3 Packet Pre-Processing PhaseAssume that a user, say Uj, enters the WSN and wants todisseminate n data items: di ¼ fkeyi; versioni; dataig, i ¼ 1;2; . . .; n. For the construction of the packets of the respectivedata, we have two methods, i.e., data hash chain and theMerkle hash tree [15].For data hash chain approach, a packet, say Pi is composedof packet header, di, and the hash value of packet Piþ1(i.e., Hiþ1 ¼ hðPiþ1Þ) which is used to verify the next packet,where i ¼ 1; . . .; n _ 1. Here each cryptographic hash Hi iscalculated over the full packet Pi, not just the data portion di,thereby establishing a chain of hashes. After that, user Ujuses his/her private key SKj to run an ECDSA sign operationto sign the hash value of the first data packet hðP1Þ and thencreates an advertisement packet P0, which consists of packetheader, user certificate Certj, hðP1Þ and the signatureSIGSKjfhðP1Þg. Similarly, the network owner assigns a predefinedkey to identify this advertisement packet.With the method of Merkle hash tree, user Uj builds aMerkle hash tree from the n data items in the following way.All the data items are treated as the leaves of the tree. A newset of internal nodes at the upper level is formed; each internalnode is computed as the hash value of the concatenationof two child nodes. This process is continued until the rootnode Hroot is formed, resulting in a Merkle hash tree withdepth D ¼ log2ðnÞ. Before disseminating the n data items,user Uj signs the root node with his/her private key SKj andthen transmits the advertisement packet P0 comprising usercertificate Certj,Hroot and SIGSKjfHrootg. Subsequently, userUj disseminates each data item along with the appropriateinternal nodes for verification purpose. Note that asdescribed above, user certificate Certj contains user identityinformation UIDj and dissemination privilege Prij. Beforethe network deployment, the network owner assigns a predefinedkey to identify this advertisement packet.5.4 Packet Verification PhaseWhen a sensor node, say Sj, receives a packet either from anauthorized user or from its one-hop neighbours, it firstchecks the packet’s key field1) If this is an advertisement packet (P0 fCertj; hðP1Þ;SIGSKjfhðP1Þgg for the data hash chain method whileP0 ¼ fCertj; root; SIGSKjfrootgg for the Merkle hashtree method), node Sj first pays attention to the legalityof the dissemination privilege Prij. For example,node Sj needs to check whether the identity of itself isincluded in the node identity set of Prij. If the resultis positive, node Sj uses the public key y of the networkowner to run an ECDSA verify operation toauthenticate the certificate. If the certificate Certj isvalid, node Sj authenticates the signature. If yes, forthe data hash chain method (respectively, the Merklehash tree method), node Sj stores <UIDj;H1 >(respectively, <UIDj; root>) included in the advertisementpacket; otherwise, node Sj simply discardsthe packet.2) Otherwise, it is a data packet Pi, where i ¼ 1; 2; . . . ;n. Node Sj executes the following procedure:For the data hash chain method, node Sj checks theauthenticity and integrity of Pi by comparing the hashvalue of Pi with Hi which has been received in thesame round and verified. If the result is positive andthe version number is new, node Sj then updates thedata identified by the key stored in Pi and replaces itsstored <round, Hi > by <round, Hiþ1 > (hereHiþ1 isincluded in packet Pi); otherwise, Pi is discarded.For Merkle hash tree method, node Sj checks theauthenticity and integrity of Pi through the alreadyverified root node received in the same round. If theresult is positive and the version number is new, nodeSj then updates the data identified by the key storedin Pi; otherwise, Pi is discarded.Remark: To prevent the network owner from impersonatingusers, system initialization and issue of user certificatescan be carried out by the certificate authority of aPKI rather than the network owner.Comparing the two methods, the data hash chainmethod incurs less communication overhead than the Merklehash tree method. In the data hash chain method, onlyone hash value of a packet is included in each packet. Onthe contrary, in the Merkle hash tree method, D (the treedepth) hash values are included in each packet. However, alimitation of the data hash tree method is that it just workswell in networks with in-sequence packet delivery. Such alimitation does not exist in the Merkle hash tree methodsince it allows each packet to be immediately authenticatedupon its arrival at a node. Therefore, the choice of eachmethod depends on this characteristic of the WSNs.5.5 EnhancementsWe can enhance the efficiency and security of DiDrip byadding additional mechanisms. Readers are referred to theAppendix of this paper for details.6 SECURITY ANALYSIS OF DIDRIPIn the following, we will analyze the security of DiDrip toverify that the security requirements mentioned in Section 3are satisfied.Distributed. As described in Section 5.2, in order to passthe signature verification of sensor nodes, each user has tosubmit his/her private key and dissemination privilege tothe network owner for registration. In addition, as describedabove, authorized users are able to carry out disseminationin a distributed manner.Supporting different user privileges. Activities of networkusers can be restricted by setting user privilege Prij, whichis contained in the user certificate. Since each user certificateis generated based on Prij, it will not pass the signature verificationat sensor nodes if Prij is modified. Thus, only thenetwork owner can modify Prij and then updates the certificateaccordingly.Authenticity and integrity of data items. With the Merklehash tree method (rsp. the data hash chain method), anauthorized user signs the root of the Merkle hash tree (rsp.the hash value of the first data packet hðP1Þ) with his/her privatekey. Using the network owner’s public key, each sensornode can authenticate the user certificate and obtains theuser’s public key. Then, using the user’s public key, eachnode can authenticate the root of the Merkle hash tree (rsp.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1133the hash value of the first data packet hðP1Þ). Subsequently,each node can authenticate other data packets based on theMerkle hash tree (rsp. the data hash chain). With the assumptionthat the network owner cannot be compromised, it isguaranteed that any forged or modified data items can beeasily detected by the authentication process.User accountability. Users’ identities and their disseminationactivities are exposed to sensor nodes. Thus, sensornodes can report such records to the network owner periodically.Since each user certificate is generated according tothe user identity, except the network owner, no one canmodify the user identity contained in the user certificatewhich passes the authentication. Therefore, users cannotrepudiate their activities.Node compromise and user collusion tolerance. As describedabove, for basic protocol, only the public parameters arepreloaded in each node. Even for the improved protocol,the public-key/dissemination-privilege pair of each networkuser is loaded into the nodes. Therefore, no matterhow many sensor nodes are compromised, the adversaryjust obtains the public parameters and the public-key/dissemination-privilege pair of each user. Clearly, the adversarycannot launch any attack by compromising sensornodes. As described in Section 5, even if some users collude,a benign node will not grant any dissemination privilegethat is beyond those of colluding users.Resistance to DoS attacks. There are DoS attacks against basicDiDrip by exploiting: (1) authentication delays, (2) the expensivesignature verifications, and (3) the Trickle algorithm.First, with the use of Merkle hash tree or data hash chain,each node can efficiently authenticate a data packet by afew hash operations. Second, using the message specificpuzzle approach, each node can efficiently verify a puzzlesolution to filter a fake signature message and to forward adata packet using Trickle without waiting for signature verification.Therefore, all the above DoS attacks are defended.DiDrip can successfully defeat all three types of DoSattacks even if there are compromised network users andsensor nodes. Indeed, without the private key and the unreleasedpuzzle keys of the network users, even an insideattacker cannot forge any signature/data packets.Ensurance of freshness. If the privilege of a user allows him/her to disseminate data items to his/her own set of nodes, theversion number in each item can ensure the freshness ofDiDrip. On the other hand, if a node receives data items frommultiple users, the version number can be replaced by a timestampto indicate the freshness of a data item. More specifically,a timestamp is attached into the root of the Merkle hashtree (or the hash value of the first data packet).Scalability. Different from centralized approaches, anauthorized user can enter the network and then disseminatesdata items into the targeted sensor nodes. Moreover,as to be demonstrated by our experiments in a testbed with24 TelosB motes in the next section, the security functions inour protocol have low impact on propagation delay. Notethat the increase in propagation delay is dominated by thesignature verification time incurred at the one-hop neighboringnodes of the authorized user. Thus, the proposedprotocol is efficient even in a large-scale WSN with thousandsof sensor nodes. Also, as shown in Section 5.2, ourprotocol can support a large number of users.Also, as described above, DiDrip can achieve dynamic participation.Moreover, in the next section, our implementationresults will demonstrate DiDrip has lowenergy overhead.In the following, we give the formal proof of the authenticityand integrity of the disseminated data items in DiDripbased on the three assumptions below:Assumption 1: There exist pseudo-random functionswhich are polynomially indistinguishable from truly randomfunctions.Assumption 2: There exist target collision-resistance(TCR) hash functions [16], where if for all probabilistic-polynomial-time (PPT) adversaries, say A, A have negligibleprobability in winning the following game: A first choose amessage m, and then A are given a random function hð:Þ. Towin, A must output m0 6¼m such that hðm0Þ ¼ hðmÞ. Notethat in our scheme, the TCR hash function can be implementedby the common hash functions, such as SHA-1.Assumption 3: ECC signature is existentially unforgeableunder adaptive chosen-message attacks. Note thatin our scheme, ECC signature can use the standardECDSA of 160 bits.Theorem 1. DiDrip achieves the authenticity and integrity of dataitems, assuming the indistinguishability between pseudo-randomnessand true randomness, and assuming that hð:Þ is aTCR hash function and ECC signature is existentially unforgeableunder adaptive chosen-message attacks.Proof. Our theorem follows from Theorem A.1 in [16], andthus here we only give a proof sketch briefly. To beginwith, we assume ECC signature generation and verificationguarantee authenticity and integrity of signed messages,and every receiver has obtained an authentic copyof the legitimate sender’s public key. And we also assumethat hð:Þ is a TCR hash function. Therefore, here the securityof DiDrip is proven based on the indistinguishabilitybetween pseudo-randomness and true randomness.First, there exists a PPT adversary A, which can defeatauthenticity of data items in DiDrip. This means that Acontrols the communication links and manages, withnon-negligible probability, to deliver a message m to areceiver R, such that the sender S has not sent m but Raccepts m as authentic and coming from S. Then there isa PPT adversary B that uses A to break the indistinguishabilitybetween pseudo-randomness and true randomnesswith non-negligible advantage. That is, B getsaccess to h (as an oracle) and can tell with non-negligibleprobability if h is a pseudorandom function (PRFð:Þ) orif h is a totally random function.To this end, B can query on inputs x of its choice and beanswered with hðxÞ. Hence first B simulates for A a networkwith a sender S and a receiver R. Then B works byrunning A in the way similar to that in [17]. Namely, Bchooses a number l 2 f1; . . .; ng at random, where n is thetotal number of packets to be sent in the data dissemination.Note that B hopes thatAwill forge the lth packet Pl.B grants access to the oracle h, which is either PRFð:Þor an ideal random function. B can adaptively query anarbitrarily chosen x to the oracle and get the outputwhich is either PRFðxÞ or a random value uniformlyselected from f0; 1g_. After performing polynomiallymany queries, B finally makes the decision of whether or1134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015not the oracle is PRFð:Þ or the ideal random function. Asa result, B wins the game if the decision is correct.We argue that B succeeds with non-negligible probabilitysketchily. If h is a truly random function then A hasonly negligible probability to successfully forge thepacket Pl in the data items. Therefore, if h is random thenB makes the wrong decision only with negligible probability.On the other hand, we have assumed that if theauthentication is done using PRFð:Þ then A forges somepacket with non-negligible probability _. It follows that ifh is PRFð:Þ then B makes the right decision with probabilityat least _=l (which is also non-negligible).In addition, if the adversary A is able to cause areceiver R to accept a forged packet Pl, it implies theadversary A is able to find a collision on hðP0lÞ ¼ hðPlÞ inthe packet Pl_1. However, according to Assumption 2,hð:Þ is a TCR hash function. Moreover, due to the unforgeabilityof ECC signature (Assumption 3), it is impossiblethat A hands R a forged initial packet from S (Fordata hash chain approach, A forges the signatureSIGSKj ðP1Þ; For Merkle hash tree method, A forges thesignature SIGSKj ðrootÞ. Therefore, the above contradictionsmeans also that the authenticity and integrity ofdata items in DiDrip. tu7 IMPLEMENTATION AND PERFORMANCEEVALUATIONWe evaluate DiDrip by implementing all components on anexperimental test-bed. Also, we choose Drip for performancecomparison.7.1 Implementation and Experimental SetupWe have written programs that execute the functions of thenetwork owner, user and sensor node. The network ownerand user side programs are C programs using OpenSSL [18]and running on laptop PCs (with 2 GB RAM) under Ubuntu11.04 environment with different computational power.Also, the sensor node side programs are written in nesCand run on resource-limited motes (MicaZ and TelosB). TheMicaZ mote features an 8-bit 8-MHz Atmel microcontrollerwith 4-kB RAM, 128-kB ROM, and 512 kB of flash memory.Also, the TelosB mote has an 8-MHz CPU, 10-kB RAM, 48-kB ROM, 1MB of flash memory, and an 802.15.4/ZigBeeradio. Our motes run TinyOS 2.x. Additionally, SHA-1 isused, and the key sizes of ECC are set to 128 bits, 160 and192 bits, respectively. Throughout this paper, unlessotherwise stated, all experiments on PCs (respectively, sensornodes) were repeated 100,000 times (respectively,1,000 times) for each measurement in order to obtain accurateaverage results.To implement DiDrip with the data hash chain method(rsp. the Merkle hash tree method), the following functionalitiesare added to the user side program of Drip: constructionof data hash chain (rsp. Merkle hash tree) of around of dissemination data, generation of the signaturepacket and all data packets. For obtaining version numberof each data item, the DisseminatorC and DisseminatorPmodules in the Drip nesC library has been modified toprovide an interface called DisseminatorVersion. Moreover,the proposed hash tree method is implemented withoutand with using the message specific puzzle approachpresented in Appendix, resulting in two implementationsof DiDrip; DiDrip1 and DiDrip2. In DiDrip1, when a nodereceives a signature/data packet with a new version number,it authenticates the packet before broadcasting it to itsnext-hop neighbours. On the other hand, in DiDrip2, anode only checks the puzzle solution in the packet beforebroadcasting the packets. We summarize the pros andcons of all related protocols in Table 2.Based on the design of DiDrip, we implement the verificationfunction for signature and data packets based onthe ECDSA verify function and SHA-1 hash function ofTinyECC 2.0 library [19] and add them to the Drip nesClibrary. Also, in our experiment, when a network user(i.e., a laptop computer) disseminates data items, it firstsends them to the serial port of a specific sensor node inthe network which is referred to as repeater. Then, therepeater carries out the dissemination on behalf of theuser using DiDrip.Similar to [7], we use a circuit to accurately measure thepower consumption of various cryptographic operationsexecuted in a mote. The Tektronix TDS 3034C digital oscilloscopeaccurately measures the voltage Vr across the resistor.Denoting the battery voltage as Vb (which is 3 volts in ourexperiments), the voltage across the mote Vm is then Vb _ Vr.Once Vr is measured, the current through the circuit I canbe obtained by using Ohm’s law. The power consumed bythe mote is then VmI. By also measuring the execution timeof the cryptographic operation, we can obtain the energyconsumption of the operation by multiplying the powerand execution time.7.2 Evaluation ResultsThe following metrics are used to evaluate DiDrip; memoryoverhead, execution time of cryptographic operations andpropagation delay, and energy overhead. The memoryoverhead measures the required data space in the implementation.The propagation delay is defined as the timefrom construction of a data hash chain until the parameterson all sensor nodes corresponding to a round of disseminateddata items are updated.TABLE 2The Pros and Cons of All Related ProtocolsHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1135Table 3 shows the execution times of some importantoperations in DiDrip. For example, the execution times forthe system initialization phase and signing a random20-byte message (i.e., the output of SHA-1 function) are1.608 and 0.6348 ms on a 1.8-GHz Laptop PC, respectively.Thus, if SHA-1 is used, generating a user certificate or signinga message takes 0.6348 ms on a 1.8-GHz Laptop PC.Fig. 3 shows the execution times of SHA-1 hash function(extracted from TinyECC 2.0 [19]) on MicaZ and TelosBmotes. The inputs to the hash function are randomly generatednumbers with length varying from 24 to 156 bytes inincrements of 6 bytes. Note that, in our protocol, the hashfunction is applied to an entire packet. There are several reasonsthat, possibly, a packet contains a few tens of bytes.First, the advertisement packet has additional informationsuch as certificate and signature. Second, with the Merklehash tree method, each packet contains the disseminateddata item along with the related internal nodes of the treefor verification purpose. Third, although several bytes is atypical size of a data item, sometimes a disseminated dataitem may be a bit larger. Moreover, for sensors with IEEE802.15.4 compliant radios, the maximum payload size is 102bytes for each packet. Therefore, we have chosen a widerrange of input size to SHA-1 to provide readers a more completepicture of the performance. We perform the sameexperiment 10,000 times and take an average over them. Forexample, the execution times on a MicaZ mote for inputs of54 bytes, 114 bytes, and 156 bytes are 9.6788, 18.947, and28.0515 ms, respectively. Also, the execution times on aTelosB mote for inputs of 54, 114, and 156 bytes are 5.7263,10.7529, and 15.629 ms, respectively.To measure the execution time of public key cryptography,as shown in Table 41, we have implemented the ECCverification operation (with a random 20-byte number asthe output) of TinyECC 2.0 library [19] on MicaZ and TelosBmotes. For example, it is measured that the signature verificationtimes are 2.436 and 3.955 seconds, which are 252 and691 times longer than SHA-1 hash operation with a 54-byterandom number as input on MicaZ and TelosB motes,respectively. It can be seen packet authentication based onthe Merkle hash tree (or data hash chain) is much more efficient.Therefore, it is confirmed that DiDrip is suitable forsensor nodes with limited resources.Next, we compare the energy consumption of SHA-1 hashfunction and ECC verification under the condition that theradio of the mote is turned off. When a MicaZ mote is used inthe circuit, Vr ¼ 138 mV, I ¼ 6.7779 mA, Vm ¼ 2.8620 V,P ¼ 19.3983 mW. When a TelosB mote is used, Vr ¼ 38 mV,I ¼ 1.8664 mA, Vm ¼ 2.9620 V, P ¼ 5.5283 mW. With theexecution time obtained from Fig. 3, the energy consumptionon the motes due to the SHA-1 operation can be determined.For example, the energy consumption of SHA-1 operationwith a random 54-byte number as input on MicaZ andTelosB motes are 0.18775 and 0.03166 mJ, respectively. Also,the energy consumption of ECC signature verificationoperation on MicaZ and TelosB motes are 2835.2555 and1316.1777 mJ, respectively.Next, the impact of security functions on the propagationdelay is investigated in an experimental network as shownin Fig. 4. The network has 24 TelosB nodes arranged in a4 _ 6 grid. The distance between each node is about 35 cm,TABLE 3Running Time for Each Phase of the Basic Protocol of DiDrip (Except the Sensor Node Verification Phase)Fig. 3. The execution times of SHA-1 hash function on MicaZ and TelosBmotes.TABLE 4Running Time for ECC Signature VerificationFig. 4. The 4 _ 6 grid network of TelosB motes for measuring propagationdelay.1. Note that ECC-160 is faster than ECC-128, because the columnwidth of ECC-160 is set to 5 for hybrid multiplication optimizationwhile that of ECC-128 is set to 4.1136 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015and the transmission power is configured to be the lowestlevel so that only one-hop neighbours are covered in thetransmission range. The repeater is acted by the node locatingat the vertex of the grid.In the experiments, the packet delivery rate from the networkuser is 5 packets/s. The lengths of round and datafields in a data item are set to 4 bits and 2 bytes, respectively.A hash function with 8-byte truncated output is usedto construct data hash chains. An ECC-160 signature is 40bytes long. Each experiment is repeated 20 times to obtainan average measurement. Figs. 5 and 6 plot the averagepropagation delays of Drip, DiDrip1, and DiDrip2 when thedata hash chain and Merkle hash tree methods areemployed, respectively. It can be seen that the propagationdelay almost increases linearly with the number of dataitems per round for all three protocols. Moreover, the securityfunctions in DiDrip2 have low impact on propagationdelay. For these five experiments of the data hash chainmethod, DiDrip2 is just 3.448, 4.158, 3.222, 3.919 and 2.855 smore than that of Drip, respectively. Note that the increasein propagation delay is dominated by the signature verificationtime incurred at the one-hop neighboring nodes of thebase station. This is because each node carries out signatureverification only after forwarding data packets (with validpuzzle solutions).Table 5 shows the memory (ROM and RAM) usage ofDiDrip2 (with the data hash chain method) on MicaZ andTelosB motes for the case of four data items per round. Thecode size of Drip and a set of verification functions fromTinyECC (secp128r1, secp160r1 and secp192r1, which areimplementations based on various elliptic curves accordingto the Standards for Efficient Cryptography Group) areincluded for comparison. For example, the size of DiDripimplementation corresponds to 26.18 and 56.82 percent of theRAMand ROMcapacities of TelosB, respectively. Clearly, theROM and RAM consumption of DiDrip is more than that ofDrip because of the extra security functions. Moreover, it canbe seen thatmajority of the increasedROMis due to TinyECC.8 CONCLUSION AND FUTURE WORKIn this paper, we have identified the security vulnerabilitiesin data discovery and dissemination when used inWSNs, which have not been addressed in previousresearch. Also, none of those approaches support distributedoperation. Therefore, in this paper, a secure and distributeddata discovery and dissemination protocolnamed DiDrip has been proposed. Besides analyzing thesecurity of DiDrip, this paper has also reported the evaluationresults of DiDrip in an experimental network ofresource-limited sensor nodes, which shows that DiDripis feasible in practice. We have also given a formal proofof the authenticity and integrity of the disseminated dataitems in DiDrip. Also, due to the open nature of wirelesschannels, messages can be easily intercepted. Thus, in thefuture work, we will consider how to ensure data confidentialityin the design of secure and distributed datadiscovery and dissemination protocols.APPENDIXFURTHER IMPROVEMENT OF DIDRIP SECURITY ANDEFFICIENCYBy the basic DiDrip protocol, we can achieve secure and distributeddata discovery and dissemination. To furtherenhance the protocol, here we propose two modifications toFig. 5. Propagation delay comparison of three protocols when the datahash chain method is employed.Fig. 6. Propagation delay comparison of three protocols when the Merklehash tree method is employed.TABLE 5Code Sizes (Bytes) on MicaZ and TelosB MotesHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1137improve the efficiency and security of DiDrip. For brevity,only those parts of the basic protocol that require changeswill be presented.Avoiding the Generation, Transmission andVerification of CertificatesThere are some efficiency problems caused by the generation,transmission, and verification of certificates. First, it isnot efficient in communication, as the certificate has to betransmitted along with the advertisement packet acrossevery hop as the message propagates in the WSN. A largeper-message overhead will result in more energy consumptionon each sensor node. Second, to authenticate each advertisementpacket, it always takes two expensive signatureverification operations because the certificate should alwaysbe authenticated first. To address these challenges, a feasibleapproach is that before the network deployment, the publickey/dissemination-privilege pair of each network user isloaded into the sensor nodes by the network owner. Once anew user joins the network after the network deployment,the network owner can notify the sensor nodes of the user’spublic key/dissemination privilege through using the privatekey of itself. The detailed description is as follows.User Joining PhaseAccording to the basic protocol of DiDrip, user Uj generatesits public and private keys and sends a 3-tuple<UIDj; Prij; PKj > to the network owner. When the networkowner receives the 3-tuple, it no longer generates thecertificate Certj. Instead, it signs the 3-tuple with its privatekey and sends it to the sensor nodes. Finally, each nodestores the 3-tuple.Packet Pre-Processing PhaseThe user certificate Certj stored in packet P0 is replaced byUIDj.Packet Verification PhaseIf this is an advertisement packet, according to the receivedidentity UIDj, node Sj first picks up the dissemination privilegePrij from its storage and then pays attention to thelegality of Prij. If the result is positive, node Sj uses the publickey PKj from its storage to run an ECDSA verify operationto authenticate the signature; otherwise, node Sjsimply discards the packet. Note that node Sj does not needto authenticate the certificate.As described above, the public-key/dissemination-privilegepair <UIDj; Prij; PKj > of each network user is just2 þ 6 þ 40 ¼ 48 bytes. Therefore, assuming the protocolsupports 500 network users, the code size is about 23 KB.We consider the resource-limited sensor nodes such asTelosB motes as examples. The 1-MB Flash memory isenough for storing these public parameters.Message Specific Puzzle Approach for Resistance toDoS attacksDiDrip uses a digital signature to bootstrap the authenticationof a round of data discovery and dissemination. Thisauthentication is vulnerable to DoS attacks. That is, anadversary may flood a lot of illegal signature message (i.e.,advertisement messages in this paper) to the sensor nodes toexhaust their resources and render them less capable of processingthe legitimate signature messages. Such an attack canbe defended by applying the message specific puzzleapproach [2]. This approach requires each signature messageto contain a puzzle solution. When a node receives a signaturemessage, it first checks that the puzzle solution is correctbefore verifying the signature. There are two characteristicsof the puzzles. First, the puzzles are difficult to be solved buttheir solutions are easy to be verified. Second, there is a tighttime limit to solve a puzzle. This discourages adversaries tolaunch the DoS attack even if they are computationally powerful.More details about this approach can be found in [2].Another advantage of applying the message specific puzzleis to reduce the dissemination delay, which is the time fora disseminated packet to reach all nodes in a WSN. Recallthat in step 1.a) of the packet verification phase, when a nodereceives the signature packet, it first carries out the signatureverification before using the Trickle algorithm to broadcastthe signature packet. This means that the disseminationdelay depends on the signature verification time tsv. On theother hand, when the message specific puzzle approach isapplied, a node can just verify the validity of puzzle solutionbefore broadcasting the signature packet. Then, the disseminationdelay only depends on the puzzle solution verificationtime tpv. Since tpv _ tsv, the dissemination delay issignificantly reduced. Moreover, the reduction in disseminationdelay is proportional to the network size. This is demonstratedby the experiments presented in Section 7.2.ACKNOWLEDGMENTSThis research is supported by a strategic research grant fromCity University of Hong Kong [Project No. 7004054], theFundamental Research Funds for the Central Universities,and the Specialized Research Fund for the Doctoral Programof Higher Education. D. He is the correspondingauthor of this article.and MEng degrees from the Harbin Institute ofTechnology, China, and the PhD degree fromZhejiang University, China, all in computerscience in 2007, 2009, and 2012, respectively.He is with the School of Computer Science andEngineering, South China University of Technology,P.R. China, and also with the College ofComputer Science and Technology, ZhejiangUniversity, P.R. China. His research interestsinclude network and systems security. He is anassociate editor or on the editorial board of some international journalssuch as IEEE Communications Magazine, Springer Journal of WirelessNetworks, Wiley’s Wireless Communications and Mobile ComputingJournal, Journal of Communications and Networks, Wiley’s Security andCommunication Networks Journal, and KSII Transactions on Internetand Information Systems. He is a member of the IEEE.Sammy Chan (S’87-M’89) received the BE andMEngSc degrees in electrical engineering fromthe University of Melbourne, Australia, in 1988 and1990, respectively, and the PhD degree in communicationengineering from the Royal MelbourneInstitute of Technology, Australia, in 1995. From1989 to 1994, he was with Telecom AustraliaResearch Laboratories, first as a research engineer,and between 1992 and 1994 as a seniorresearch engineer and project leader. SinceDecember 1994, he has been with the Departmentof Electronic Engineering, City University of Hong Kong, where he is currentlyan associate professor. He is a member of the IEEE.Mohsen Guizani (S’85-M’89-SM’99-F’09)received the BS (with distinction) and MSdegrees in electrical engineering, the MS andPhD degrees in computer engineering in 1984,1986, 1987, and 1990, respectively, from SyracuseUniversity, Syracuse, New York. He is currentlya professor and the associate vicepresident for Graduate Studies at Qatar University,Qatar. His research interests include computernetworks, wireless communications andmobile computing, and optical networking. Hecurrently serves on the editorial boards of six technical journals andthe founder and EIC of “Wireless Communications and MobileComputing” Journal published by John Wiley (http://www.interscience.wiley.com/jpages/1530-8669/). He is a fellow of the IEEE and a seniormember of ACM.Haomiao Yang (M’12) received the MS and PhDdegrees in computer applied technology from theUniversity of Electronic Science and Technologyof China (UESTC) in 2004 and 2008, respectively.From 2012 to 2013, he worked as a postdoctoralfellow at Kyungil University, Republic ofKorea. Currently, he is an associate professor atthe School of Computer Science and Engineering,UESTC, China. His research interestsinclude cryptography, cloud security, and bigdata security. He is a member of the IEEE.Boyang Zhou is currently working toward thePhD degree from the College of Computer Scienceat Zhejiang University. His research areasinclude software-defined networking, futureinternet architecture and flexible reconfigurablenetworks.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1139
Receiver Cooperation in Topology Control for Wireless Ad-Hoc Networks
Abstract—We propose employing receiver cooperation in centralizedtopology control to improve energy efficiency as well asnetwork connectivity. The idea of transmitter cooperation hasbeen widely considered in topology control to improve networkconnectivity or energy efficiency. However, receiver cooperationhas not previously been considered in topology control. In particular,we show that we can improve both connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. Consequently, we conclude that a system based bothon transmitter and receiver cooperation is generally superior toone based only on transmitter cooperation. We also show that theincrease in network connectivity caused by employing transmittercooperation in addition to receiver cooperation is at the expense ofsignificantly increased energy consumption. Consequently, systemdesigners may opt for receiver-only cooperation in cases for whichenergy efficiency is of the highest priority or when connectivityincrease is no longer a serious concern.Index Terms—Ad-hoc network, energy efficiency, multi-hopcommunications, network connectivity, receiver cooperation,topology control, transmitter cooperation.I. INTRODUCTIONTHE wireless ad-hoc network has been receiving growingattention during the last decade for its various advantagessuch as instant deployment and reconfiguration capability. Ingeneral, a node in a wireless ad-hoc network suffers fromconnectivity instability because of channel quality variation andlimited battery lifespan. Therefore, an efficient algorithm forcontrolling the communication links among nodes is essentialfor the construction of a wireless ad-hoc network. In a topologycontrol scheme, communication links among nodes are definedto achieve certain desired properties for connectivity, energyconsumption, mobility, network capacity, security, and so on.In this paper, we propose topology control schemes that aimManuscript received February 23, 2014; revised July 24, 2014 and November12, 2014; accepted November 12, 2014. Date of publication December 4, 2014;date of current version April 7, 2015. Part of this work was presented at IEEEWCNC, Shanghai, China, April 2013. This work was supported in part byBasic Science Research Program through the National Research Foundationof Korea (NRF) funded by the Ministry of Education (NRF-2010-0025062 andNRF-2013R1A1A2011098). The associate editor coordinating the review ofthis paper and approving it for publication was M. Elkashlan. (Correspondingauthors: Do-Sik Yoo and Seong-Jun Oh).K. Moon, W. Lee, and S.-J. Oh are with Korea University, Seoul 136-701,Korea (e-mail: keith@korea.ac.kr; wlee@korea.ac.kr; seongjun@korea.ac.kr).D.-S. Yoo is with the Department of Electronic and Electrical Engineering,Hongik University, Seoul 121-791, Korea (e-mail: yoodosik@hongik.ac.kr).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TWC.2014.2374617to increase the energy efficiency and the network connectivitysimultaneously.In a wireless ad-hoc network, two nodes that are not directlyconnected may possibly communicate with each other throughso-called multi-hop communications [1], [2]. By employingmulti-hop communication, a node in a wireless ad-hoc networkcan extend its communication range through cascaded multihoplinks and eliminate some dispensable links to reduce the totalrequired power. Various efforts have been made to study howthe links must be maintained and how much power must be associatedwith each of those links for optimal network operationsdepending on the situation at hand. For example, Kirousis et al.[3] and Clementi et al. [4] studied the problem of minimizingthe sum power consumption of the nodes in an ad-hoc networkand showed that this problem is nondeterministic polynomialtime(NP) hard. Because the sum power minimization problemis NP hard, the authors in [4] proposed a heuristic solution forpractical ad-hoc networks. Ramanathan and Rosales-Hain, in[5], proposed two topology control schemes that minimize themaximum transmission power of each node with bi-directionaland directional strong connectivities, respectively. When thenumber of participating nodes is very large, it is crucial toreduce the transmission delay due to multi-hop transmissions.To maintain the total transmission delay within a tolerable limit,Zhang et al. studied delay-constrained ad-hoc networks in [6]and Huang et al. proposed a novel topology control scheme in[7] by predicting node movement.In [3]–[7], it was assumed that there exists a centralizedsystem controlling nodes so that global information such asnode positions and synchronization timing is known by eachnode in advance. However, such an assumption can be toostrong, especially in the case of ad-hoc networks. For thisreason, a distributed approach has been widely considered [8]–[11], where each node has to make its decision based on the informationit has collected from nearby neighbor nodes. Li et al.proposed a distributed topology control scheme in [8] andproved that the distributed topology control scheme preservesthe network connectivity compared with a centralized one.Because the topology control schemes in [3]–[8] guarantee onlyone connected neighbor for each node, the network connectivitycan be broken even when only a single link is disconnected.Accordingly, a reliable distributed topology control scheme thatguarantees at least k-neighbors was proposed in [9]. The resultin [9] was extended to a low computational complexity schemein [10], to a mobility guaranteeing scheme in [11], and to anenergy saving scheme in [12].1536-1276 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1859In [13], the concept of cooperative communications was firstemployed in centralized topology control, where it was shownthat cooperative communications can dramatically reduce thesum power consumption in broadcast network. Cardei et al.applied the idea of [13] to wireless ad-hoc networks in [14].Yu et al. further showed that cooperative communications canextend the communication range of each node with only amarginal increment in power consumption so that networkconnectivity is increased in an energy efficient manner [15],[16]. Because of these various advantages, the idea of cooperativecommunications has been widely considered in recentstudies on topology control to maximize capacity [17], improverouting efficiency [18], and mitigate interference from nearbynodes [19], [20]. The idea of cooperative communications inthese previous works [13]–[20] is realized in the followingway. First, a transmitting node sends a message to its neighbornodes (called helper nodes). After the helper nodes decode themessage, they (as well as the transmitting node in some cases)retransmit the message to a receiving node, and the receivingnode decodes the message by combining the signals frommultiple nodes. Therefore, strictly speaking, only the conceptof transmitter cooperation has been employed, and receivercooperation has not been considered.In this paper, we propose to employ the idea of receivercooperation in centralized topology control schemes, possiblyin combination with transmitter cooperation, to increase thenetwork connectivity in an energy efficient way. Consequently,we propose two centralized topology control schemes, onebased solely on receiver cooperation, and the other basedboth on transmitter and receiver cooperation. For comparisonwith proposed schemes, we consider a cooperative topologycontrol scheme in [16] that is based solely on transmittercooperation. We show, through extensive simulations, that wecan improve both network connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. We conclude that the system based both ontransmitter and receiver cooperation is generally superior tothat based only on transmitter cooperation. We also showthat the system based solely on receiver cooperation is asenergy efficient as one based both on transmitter and receivercooperation despite a slight decrease in network connectivity.Although the system based both on transmitter and receivercooperation achieves higher network connectivity than onebased only on receiver cooperation, we show that the additionalconnectivity increase requires significantly increased energyconsumption. For this reason, system designers may opt forreceiver-only cooperation, if energy efficiency is of the highestpriority or connectivity increase is no longer of seriousconcern.The remainder of this paper is organized as follows.In Section II, we describe the channel model consideredthroughout this paper. In Section III, we explain the topologycontrol scheme without cooperation that underlies the twocooperative topology control schemes considered in this paper.The two cooperative topology control schemes are then describedin Section IV. Furthermore, the performance of the twocooperative topology control schemes are numerically analyzedin Section V. Finally, we draw conclusions in Section VI.II. SYSTEM MODELIn this section, we describe the system model consideredthroughout this paper. We consider a network V ≡{v1, v2, . . . , vn} consisting of n nodes that are assumed to beuniformly distributed over a certain region in R2. The nodesare assumed to communicate with one another by transmittingsignals over a wireless channel with given bandwidth W. Weassume that the physical location of each node does not changewith time.To model a practical wireless channel, we assume that thepath loss PL(di j) between nodes vi and vj is given byPL(di j)[dB] = PLd0 +10k log_di jd0_+2loghi j +Xó+c. (1)Here, PLd0 is the reference path loss at unit distance d0 obtainedfrom the free space path loss model [21], and k denotes the pathloss exponent that represents how quickly the transmit powerattenuates as a function of the distance. The variables di j andhi j respectively denote the distance and the randomly varyingfast fading coefficient between vi and vj . In addition, Xó is arandom variable introduced to account for the shadowing effect.We assume that hi j and Xó vary independently from packetto packet, but remain constant during each packet duration.We assume further that h2i j follows a ÷2-distribution with twodegrees of freedom and Xó follows a normal distribution withzero mean and standard deviation ó. Finally, the variable c is theoffset correction factor between the mathematical model andfield measurement. We note that the values of PLd0 , d0, k, ó,and c vary depending on channel scenario, urban or suburban[22]. For given PLd0 , d0, k, ó, and c, when node vi transmitsa signal to node vj with power Pi, the received signal to noiseratio (SNR) ãi j(Pi) is given asãi j(Pi) =PiN0, jW×100.1×PL(di j), (2)where N0, j denotes the one-sided noise power spectral densityat vj . Throughout this paper, we assume that the maximumtransmit power of each node is given by Pmax.As the final issue in the system model, we briefly discuss networksynchronization. Communication in a completely asynchronousmanner is impossible, or at least be very difficult toachieve. In fact, synchronization can be a particularly importantissue in ad-hoc networks [23]–[25]. In this paper, we assumethat symbol level synchronization is maintained among participatingnodes. Although detailed synchronization techniquesare not the main focus of this paper, we briefly describe howthe issue of synchronization can be resolved with existingmethods. Synchronization techniques have been reported thatit can achieve time errors around 3 ∼ 7 μs. At such a level ofsynchronization, it will become desirable to maintain symbolduration longer than 50 μs, which corresponds to symbol rateof up to 20 kilo-symbols per second. A symbol rate of 20 kilosymbolswith rudimentary binary phase shift keying (BPSK)modulation results in a data-rate of only 20 kbps, which is notvery high. However, we can employ multi-carrier techniquessuch as orthogonal frequency division multiplexing (OFDM)to increase the data rate while maintaining or reducing thesymbol rate. For example, if we employ an OFDM system1860 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 1. A pictorial representation of G = (V,E) with V = {v1, . . . , v8} and E = {(v1;v2)NN, (v1;v3)NN, (v4;v5)NN, (v4;v6)NN, (v4;v7)NN}.with 512 subcarriers, the data rate can be increased to about10 Mbps using a simple BPSK sub-carrier modulation scheme.Consequently, even with existing techniques such as the OFDMscheme and synchronization algorithms proposed in [25], it ispossible to maintain the symbol-level synchronization requiredto implement the algorithms proposed in this paper.III. NODE-TO-NODE TOPOLOGY CONTROLIn this section, we explain a topology control scheme, whichwe refer to as the node-to-node topology control (NNTC)scheme, that is based solely on node-to-node communicationlinks. To describe the NNTC scheme, we first consider theconcept of a wireless communication link between two nodesand its related definitions. In this paper, a wireless link betweentwo nodes is said to exist if the received SNR exceeds a certainthreshold, meaning that the packet error probability is below acertain level (corresponding to the threshold). More formally,we say that there exists a node-to-node (N-N) link from node vito node vj if and only iff (ãi j(Pi)) ≤ áô, (3)for a certain transmit power Pi ≤ Pmax from vi. Here, f : R + →[0,1] denotes the packet error probability function associatedwith the given coding and modulation scheme and áô is thegiven threshold on the packet error probability, which we callthe error threshold hereafter. We assume that f is a monotonicallydecreasing continuous function and that all the nodesshare the same packet error probability function f .1When there exists a uni-directional N-N link from vi to vj ,the power Pi that satisfies (3) with equality, which we denoteby PNN(vi → vj), is called the minimum N-N routable powerof N-N link from vi to vj . We note that PNN(vi → vj) directlyfollows from the definition thatPNN(vi →vj) =N0, jW f−1(áô)100.1×PL(di j). (4)If both the uni-directional N-N links from vi to vj and from vjto vi exist, we say that there exists an N-N bi-directional link,or simply an N-N link between the two nodes vi and vj that1In many previous works on topology controls [14]–[17], (3) is equivalentlywritten as ãi j(Pi) ≥ SNRô ≡ f−1(áô). However, to consider the receivercooperation scheme in a unified framework, we directly consider the packeterror probability function f .we denote by (vi; vj)NN. The minimum N-N round-trip powerPNN(vi, vj) of the bi-directional N-N link (vi; vj)NN is definedas the sum of the two uni-directional minimum N-N routablepowers, namely, asPNN(vi, vj) = PNN(vi →vj)+PNN(vj →vi). (5)We note that there are some situations in which two nodes viand vj can communicate with each other even if there is no NNlink between vi and vj . For example, we consider the case inwhich there are two N-N links (v1; v2)NN and (v1; v3)NN. In thiscase, v2 and v3 can exchange a message through v1 even if thereis no N-N link between v2 and v3. To route a message throughmultiple N-N links, all available N-N links should be knownto the nodes. To reduce the routing complexity, only some ofthe existing N-N links are used for communications in practice.By eliminating redundant links, we can simplify the messagerouting protocol and save power consumed for exchangingreference signals such as pilot and channel information [26],[27].We denote the set of N-N links to be used for routing by E.Consequently, (vi; vj)NN ∈ E means that there exists N-N link(vi; vj)NN and this N-N link is to be used for routing. Here, wenote that (vi; vj)NN /∈E does not necessarily mean that there isno N-N link between vi and vj . In graph theory, the combinationG = (V,E) of V and E is called a graph with vertex set V andedge set E. In the remainder of this paper, nodes and links shallalso be referred to as vertexes and edges, respectively.For a given E, if (vi; vj)NN ∈ E, vi is said to be a neighborof vj and vice versa. We denote by N(vi|E) the setof neighbors of vi. For illustration, we consider the graphG = (V,E) with V = {v1, v2, . . . , v8} and E = {(v1; v2)NN,(v1; v3)NN, (v4; v5)NN, (v4; v6)NN, (v4; v7)NN}, which compactlydescribes the situation in Fig. 1. In this example, v5, v6 andv7 are neighbors of v4, therefore, N(v4|E) = {v5, v6, v7}. Here,we note that v5 is not a neighbor of v7, however, it is possiblefor v5 to send a message to v7 if (v4; v5)NN and (v4; v7)NNare cascaded. Likewise, if vi and vj can send a message bidirectionallyusing a single or cascaded multiple N-N edges, wesay that vi and vj are connected by N-N edges. The maximal setof nodes connected by N-N edges in E is referred to as a cluster.For notational convenience, a given cluster {vi1 , vi2 , . . . , vim} isdenoted by Ùmax{i1,i2,…,im}. For instance, in Fig. 1, there arethree clusters {v1, v2, v3}, {v4, v5, v6, v7}, and {v8}, which aredenoted by Ù3, Ù7, and Ù8, respectively. As shown in thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1861Fig. 2. Steps to construct the edge set E for a given node distribution V. (a) Identification of all N-N links. (b) A typical example of a spanning forest ofGL = (V,L).example, several clusters can exist for a given graph.We denotethe set of all clusters by V . We note that V = {Ù3,Ù7,Ù8} inthe above example.We now describe precisely how the set E of N-N edges tobe used for routing in the NNTC scheme is constructed. For agiven node set V, the set L of all existing N-N links and the setV of clusters defined by the graph GL = (V,L) are identified.Next, the edge set E is defined as a subset of L such that thegraph G = (V,E) also leads to the same cluster set V as graphGL = (V,L). Several candidate algorithms exist that can buildE such as breath-first search (BFS) [28] and depth-first search(DFS) [29]. In this paper, we use the minimum-weight spanningforest (MSF) algorithm that aims to build a sparse edge setusing the optimal average power required for network structureconstruction [1], [8], [15], [16]. In the MSF algorithm, first a setTÙ called a minimum spanning tree (MST), is defined for eachcluster Ù ∈ V . After obtaining all the MSTs, the set FV , calledthe minimum spanning forest of V, is defined as the union of allthe MSTs, namely, asFV = _Ù∈VTÙ, (6)which is defined to be edge set E in the NNTC scheme.It now remains to describe how the MST TÙ is obtained foreach cluster Ù ∈ V . If Ù is a singleton, then TÙ is defined to bethe empty set /0. If Ù contains more than one node, to obtain TÙ,it is necessary to consider the set L|Ù of all edges that connectnodes in Ù. For instance, we consider the example depictedin Fig. 2(a) in which the network consists of three singletonclusters and nine non-singleton clusters. For a non-singletoncluster Ù encircled by a red colored line, the edge set L|Ù isdefined as the set of all edges inside the red circle. We call asubset T of L|Ù a spanning tree of Ù if and only if there are nocycles (loops) in T and if any two nodes in Ù are connected byedges in T. For example, the edge set of each cluster depictedin Fig. 2(b) is a spanning tree of that cluster. Among all theexisting spanning trees of Ù, the one that leads to the minimumedge-weight sum is referred to as the MST TÙ of Ù. Here, theminimum N-N round-trip power PNN(vi, vj) of the N-N link isused for the weight of each edge (vi, vj)NN ∈ L.We note that transmission through the link in FV is not completelyerror-free, but has a packet error probability of áô. However,in the following, we assume that the communication linkin FV is error-free, possibly with the help of an automatic repeatand request (ARQ) scheme. Clearly, the repeated transmissionwill consume additional energy. However, even with the simplestARQ scheme, the average required energy to complete asuccessful transmission is increased from a single transmission(with packet error rate áô) by a factor of 1/(1−áô) [30]. Wenote that the factor 1/(1−áô) is reasonably close to 1 if áô ischosen to be small, say, less than 0.1. Therefore, if áô is sufficientlysmall, the additional cost for error-free communicationis only a small fraction of the total cost and hence is negligible.IV. COOPERATIVE TOPOLOGY CONTROLWe note that inter-cluster communication, namely, communicationbetween nodes belonging to different clusters is notpossible solely through cascaded N-N links. To make interclustercommunications possible, [16] employed the idea oftransmitter cooperation in which multiple nodes in one clustersimultaneously transmit the same message to a single node inanother cluster. In [16], to keep the additional complexity due tothe employment of cooperative transmission manageable, it wasassumed that a pair of nodes belonging to two communicatingclusters were pre-assigned so that communications between thetwo clusters could only happen between these two nodes withthe help of nodes in their neighborhoods. We note that notonly the neighboring nodes around the transmitting node butalso the nodes around the receiving node can help to establishinter-cluster communications. Consequently, in this paper, wepropose to employ receiver cooperation in which the interclustercommunication is regarded as successful if the receivingnode or any of the neighboring nodes succeeds in receiving the1862 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015message correctly. If the neighboring nodes only around thereceiving node participate in the cooperation, the establishedlink between two clusters is referred to as the node-to-cluster(N-C) link. Furthermore, if neighboring nodes around boththe transmitting and receiving nodes participate in the linkestablishment, the inter-cluster communication link is calledcluster-to-cluster (C-C) link.In this section, we describe two centralized cooperativetopology control schemes based on N-C and C-C links thatare referred to as node-to-cluster topology control (NCTC) andcluster-to-cluster topology control (CCTC) schemes, respectively.In each of these cooperative topology control schemes,cooperative links are employed to connect the clusters obtainedfrom the graph G = (V,E) described in Section III. Consequently,the network configuration defined in a cooperativetopology control scheme is described by four sets, namely, theset V of nodes, the set E of edges used for routing in the NNTCscheme, the set V of clusters defined by the graph G = (V,E),and the set E of cooperative edges. For this reason, the networkconfigurations defined in the NCTC and CCTC schemes areidentified by GNC = (V,E,V ,ENC) and GCC = (V,E,V ,ECC),respectively. Here, ENC and ECC consist only of N-C and C-Cedges, respectively.A. NCTCIn this subsection, we describe how the network configurationGNC = (V,E,V ,ENC) corresponding to the NCTC schemeis defined. Given graph G = (V,E) and corresponding clusterset V , the edge set ENC is obtained in three steps. First, theset LNC of all N-C links connecting clusters in V is identified.Next, for each N-C link in LNC, the weight of the link isdefined as the minimum power required to establish it. Finally,the desired edge set ENC is defined as the MSF of the graphGLNC = (V ,LNC).To describe the NCTC scheme, we first define the nodeto-cluster (N-C) link. For more concrete understanding ofN-C link, we consider a simple example of receiver cooperationbetween two clusters Ù3 ={v1, v2, v3} and Ù7 ={v4, v5, v6, v7}.For illustration, we assume that the inter-cluster communicationlink between two clusters is established if the error probabilityis less than or equal to 0.1. We assume that the decoding errorprobabilities at nodes v4, v5, v6 and v7 are, respectively, given as0.3, 0.4, 0.8, and 0.9 when v1 sends a message with maximumpower. Consequently, node v1 and a node in Ù7 cannot establishinter-cluster communications between Ù3 and Ù7 throughN-N links. However, if any of the nodes in Ù7 succeed incorrectly decoding the message, the message can be routed toany of the desired nodes in Ù7. If such receiver cooperation isemployed, communication fails only when all four nodes v4, v5,v6, and v7 fail to decode the message at the same time. We notethat such a probability is 0.3×0.4×0.8×0.9 = 0.0864 < 0.1.For this reason, we say that cooperative communication linkbetween Ù3 and Ù7 is established.In the above example, all nodes in the receiving cluster tryto decode the transmitted message. However, if the size of thereceiving cluster is large, the routing protocol and maintenancecost can become very burdensome. For this reason, we assumethat a certain receiving node and its one hop neighbors participatein the receiver cooperation. To be more precise, for a givenpair of clusters, a certain node is selected from each clusterand the signal is assumed to be transmitted from either of thesetwo nodes and then received by the other node and its one-hopneighbors.We note that there exists a more aggressive method of receivercooperation than the one described above. For example,the bridge node can achieve a huge combining gain if thehelper nodes transmit observed soft information rather thandecoded bits. However, the transmission of the observed datagenerally consumes large amount of energy and bandwidth.Consequently, a sufficiently fine quantization must be consideredto employ soft combining. Because this problem ishighly complex, we assume in this paper that the helper nodesdecode the message and deliver it to the bridge node. However,considering the importance of this problem, serious researchemploying soft combining schemes should be pursued.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from the given graph G = (V,E) defined inthe NNTC scheme. We formally define the concept of an N-Clink as follows.Definition 1: Let vbl∈ Ùl and vbm∈ Ùm. Then, we saythat there exists a bi-directional N-C link, or simply, a N-Clink denoted by (vbl ,N(vbl|L); vbm,N(vbm|L))NC between Ùland Ùm, if and only ifÐvr∈{vbm}∪N(vbm|L)f_ãbl r(Pbl )_≤ áô (7)andÐvr∈{vbl}∪N(vbl|L)f (ãbmr(Pbm)) ≤ áô (8)for some Pbl≤ Pmax and Pbm≤ Pmax.Here, L denotes the set of all N-N links described inSection III. In other words, all one-hop neighbors of the receivingnode are assumed to participate in receiver cooperationregardless whether they belong to E. We note that the errorprobability between helper and bridge node is assumed tobe zero, as mentioned in Section III. For a given N-C link(vbl ,N(vbl|L); vbm,N(vbm|L))NC, nodes vbl and vbm and setsN(vbl|L) and N(vbm|L) are called the bridge nodes and helpersets, respectively.In Definition 1, we note that the sum of the Pbl and Pbm valuesthat satisfy (7) and (8) with equality is the minimum total transmissionpower required to make round-trip communicationbetween Ùl and Ùm through (vbl ,N(vbl|L); vbm,N(vbm|L))NC.Because the sum Pbl +Pbm depends on the choice of the N-Clink, it is natural to choose the N-C link that minimizes the sumpower Pbl +Pbm. The minimized sum power shall be referred toas the minimum N-C round-trip power and the correspondingN-C link as the minimum power N-C link between Ùl and Ùm.We denote by PNC(Ùl ,Ùm) the minimum N-C round-trip powerbetween Ùl and Ùm.We now describe how we establish communications betweenÙl and Ùm. First, let vbl∈ Ùl and vvm∈ Ùm be the bridgenodes of the minimum power N-C link between Ùl and Ùmand let Hl and Hm be the helper sets of the link. We nowMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1863Fig. 3. Steps to construct the edge set ENC for the given graph G = (V,E). (a) Identification of all N-C links. (b) A typical example of a spanning forest ofGLNC = (V ,LNC).assume that a source node vs in Ùl −{vbl} attempts to senda message to destination node vd in Ùm −{vbm}. In this case,vs sends the message to bridge node vbl through cascaded N-Nedges, and then bridge node vbl transmits the message to Ùm.The message sent from vbl is then decoded at bridge nodevbm and all the nodes in the helper set Hm. Because of thedefinition of the N-C link, the message must be decoded, withnegligible failure rate, at least at one node in {vbm} ∪ Hm.Because Hm consists only of the one hop neighbors of vbm, thenodes that successfully decode the message can be determinedby vbm with little overhead. After determining the nodes thatsuccessfully decoded the message, vbm delivers the message totarget destination node vd through the cascaded N-N edges.Finally, we describe how the edge set ENC is constructedin the NCTC scheme. First, the minimum power N-C link isidentified for each pair of clusters between which N-C linksexist. Let LNC denote the set of the minimum power N-C linksobtained as the result. For each (vbl ,Hl ; vbm,Hm)NC ∈ LNC,the weight is then defined as the corresponding minimum N-Cround-trip power. After computing all the weights of LNC, thesparse edge set ENC is defined as theMSF of GLNC =(V ,LNC).Note that the MSF construction procedure described inSection III can be directly applied here by substituting V andL with V and LNC, respectively. In Fig. 3, the procedure isillustrated. For instance, Fig. 3(a) indicates all the minimumpower N-C links between clusters by solid red lines andFig. 3(b) illustrates the shape of a typical spanning forest thatdoes not include any loops. Likewise, after finding all thespanning forests of GLNC = (V ,LNC), the one that minimizesthe sum weight is defined as the MSF ENC. After obtainingthe ENC, the desired final graph GNC = (V,E,V ,ENC) for theNCTC scheme is constructed.B. CCTCIn this subsection, we describe the CCTC scheme and explainhow the network configuration GCC = (V,E,V ,ECC) correspondingto the CCTC scheme is defined. We first explainthe concept of a cluster-to-cluster (C-C) link and the relatedrouting protocol with a simple example. We assume that sourcenode vs ∈ Ùl attempts to send a message to destination nodevd ∈ Ùm. In this case, vs sends a message through cascadedN-N edges to a pre-defined bridge node vbl . After receivingthe message, vbl disseminates the message to the nodes in apre-defined helper set Hl . After decoding the message, vbl andvhl∈ Hl simultaneously transmit the message to Ùm in thenext time frame. In Ùm, a pre-defined bridge node vbm andthe nodes in a pre-defined helper set Hm attempt to decode themessage with the multiple signal replicas from the transmitters.If the maximum ratio combiner (MRC) [31] is employed at thereceiving node vr ∈ {vbm}∪Hm, the combined average receivedSNR ¯ãr at vr can be written as¯ãr = ãbl r(Pbl)+ Óvhl∈Hlãhl r(Phl ), (9)and the decoding error probability at vr is given as f (¯ãr). Toestablish the symbol combining in (9), the same signals fromthe multiple transmitters should be received at the same timeas assumed in [13]. We note that problems related to timesynchronization were discussed in Section II. Similarly to thecase for N-C links, we say that the message is decodable, withnegligible failure rate, at least at one node in {vbm}∪Hm ifÐvr∈{vbm}∪Hmf (¯ãr) ≤ áô (10)with small enough áô, where f (·) denotes the common packeterror probability function for given received SNR, as defined inSection II. If the inequality (10) holds, we say that there exists aC-C link from Ùl to Ùm. Once the message is decoded at nodesin {vbm}∪Hm, the message is delivered to destination node vdthrough cascaded N-N edges to complete the routing procedure.To maintain the C-C link power efficiently, it is necessaryto choose appropriately the node pair (vbl , vbm), the helper set1864 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015(Hl ,Hm), and the transmission power from each transmittingnode to minimize the power consumption. However, the computationalcomplexity makes such an optimization algorithmhardly feasible not only in practical systems but also in simulationenvironments [14]. For this reason, it is widely assumedthat nodes participating in transmitter cooperation use the samepower [15], [16]. Consequently, we adopt the same assumptionwhen designing the CCTC scheme.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from a given graph G = (E,V). We definethe concept of a C-C link in the following definition.Definition 2: Let vbl∈ Ùl , vbm∈ Ùm, Hl ⊂ N(vbl|L), andHm ⊂ N(vbm|L). Then, we say that there exists a bi-directionalC-C link, or simply, a C-C link denoted by (vbl ,Hl ; vbm,Hm)CCbetween Ùl and Ùm if and only ifÐvr∈{vbm}∪Hmf⎛⎝ãbl r(Pcl)+ Óvhl∈Hlãhl r(Pcl )⎞⎠ ≤ áô, (11)andÐvr∈{vbl}∪Hlfãbmr(Pcm)+ Óvhm∈Hmãhmr(Pcm)_≤ áô (12)for some Pcl≤ Pmax and Pcm≤ Pmax.Here, Pcl and Pcm denote the common transmission powersof transmitting nodes in Ùl and Ùm, respectively. For a givenC-C link (vbl ,Hl ; vbm,Hm)CC, the nodes vbl and vbm are calledthe bridge nodes and the sets Hl and Hm are called the helpersets between Ùl and Ùm. Such terminology is the same for ofN-C links. However, in the case of C-C links, the nodes in thehelper set participate not only in receiver cooperation but alsoin transmitter cooperation.In Definition 2, we note that the total transmission powerminimally required to make round-trip communication betweenÙl and Ùm is given by (|Hl |+1)Pcl +(|Hm|+1)Pcm using thevalues for Pcl and Pcm that satisfy (11) and (12) with equality.Here, |X| denotes the cardinality of set X. We also note that therequired total transmission power (|Hl |+1)Pcl +(|Hm|+1)Pcmvaries depending on the choice of the C-C link. Consequently, itis natural to choose the C-C link that leads to the smallest totalrequired transmission power. The smallest total required transmissionpower and the corresponding C-C link are referred to asthe minimum C-C round-trip power and minimum power C-Clink between Ùl and Ùm, respectively. We denote the minimumC-C round-trip power between Ùl and Ùm by PCC(Ùl ,Ùm).We now describe how the edge set ECC is constructed inthe CCTC scheme. We note that the procedure for obtainingECC is essentially the same as that for obtaining ENC. Therefore,we describe it with brevity. First, the set LCC of all theminimum power C-C links between clusters is identified. Foreach (vbl ,Hl ; vbm,Hm)CC ∈ LCC, the weight is defined as thecorresponding minimum C-C round-trip power. After computingall the weights of LCC, the sparse edge set ECC is definedas the MSF of GLCC = (V ,LCC). After obtaining ECC, thedesired final graph GCC = (V,E,V ,ECC) for CCTC scheme isconstructed.Next, we briefly remark on the additional receiver processingcosts required for the NCTC and CCTC schemes. Comparedto the transmitter cooperative topology control scheme in [16],additional decoding power is required in the NCTC and CCTCschemes because of multiple-node decoding. This additionaldecoding increases not only the power consumption, but alsothe overall system complexity. Furthermore, each receivinghelper node should report the received message decodabilityto the bridge node, which increases system overhead. Thereare some analytical studies on receiving power consumption[32], [33] and overhead [34] because it could be a critical issuein the case of ad-hoc networks. However, we note that thedecoding power consumption and related overhead are heavilydependent on the receiving strategy. For example, one can chosea receiving strategy in which the receiving helper nodes decodethe message in the order of channel conditions until a successfuldecoding node appears. In this case, the average decodingpower consumption and system complexity can be reduced.In addition, the serach for the optimal receiving strategy ishighly non-trivial and requires serious and independent study.However, despite its importance, in this primary effort ontopology control, we do not consider such issues any furtherto keep the problem tractable.Finally, we briefly consider the impact of mobility on the proposedtopology control schemes. Unfortunately, the proposedschemes are basically inapplicable except when the mobilityis very low. When a node moves, three situations can happen.First, in some situations in which only minor movement isinvolved, there may be no changes in the network topologyexcept for the configurations inside the cluster to which themoved node belongs. Second, in other situations, the clusterto which the moved node originally belonged, must be dividedinto more than one cluster. Finally, in still other situations,some clusters could be unified into one cluster by the N-Nlinks newly defined by the node movement. In the first case,the mobility problem is relatively simple. If the moved node isnot a bridge or helper node, the moved node could be simplyattached to the nearby cluster. On the other hand, if the movednode is a bridge or helper node, the bridge and/or helper nodesof the corresponding cooperative link are changed to one of thealternatives among the pre-stored alternative bridge and helpernodes. However, if there is no alternative bridge and/or helpernode or if the second or the third situation occurs, clusters andcooperative edges should be redefined. In addition, if severalnodes move at the same time, the second and third situationsmay happen more frequently and this is why the proposedschemes are applicable only when the mobility is very low.V. PERFORMANCE EVALUATIONAND NUMERICAL RESULTSIn this section, we analyze through simulations the performanceof the two proposed centralized topology controlschemes, namely, the NCTC and CNTC schemes, and comparethem to the NNTC scheme and cooperative topology controlscheme in [16] that is based solely on transmitter cooperation.For convenience, we call the topology control scheme in[16] the cluster-to-node topology control scheme (CNTC). ToMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1865TABLE ISIMULATION CONFIGURATION PARAMETERSour best knowledge, the CNTC scheme achieves the highestconnectivity with a power requirement that is onl marginallygreater than other existing topology control schemes. In thissection, we show that the proposed NCTC scheme providesbetter energy efficiency with marginal connectivity loss and theCCTC scheme allows both better energy efficiency and higherconnectivity than the CNTC scheme.A. Simulation ConfigurationThe system performance is evaluated through simulations inthis paper. Although analytic evaluation is generally more desirable,the performance of topology control schemes is very hardto analyze. To the best of our knowledge, only some analyticalresults have been obtained for the case of non-cooperativecommunications among an infinite number of nodes [35], [36]and previous studies [13]–[16] on cooperative topology controlschemes have only been evaluated through numerical simulations.For this reason, we study the performance throughsimulations. However, we provide partial analytical reasoningwhenever possible. Furthermore, to improve the value of theresults, we reflect practical situations as much as possiblein simulation configuration by employing channel parametersbased on actual field measurement [22] and the design parametersin the 3GPP standard [37].To describe the system configuration used for performanceevaluation, we need to specify the values of various parameters,which we divide into two categories: channel parameters andsystem design parameters. The channel parameters includethe reference path loss PLd0 , path loss exponent k, shadowingrandom variable Xó, offset correction factor c, and noisepower spectral density N0,i. First, we assumed that N0,i, i =1, . . . ,n, were identically given as −174 dBm/Hz, the noisepower spectral density at the room temperature. For the otherchannel parameters PLd0 , k, Xó, and c, we consider two setsof values, given in Table I, that represent suburban and urbanscenarios [22].The system design parameters considered in this section arethe number of nodes n, simulation area A, error threshold áô,packet error function f , and maximum transmit power Pmax.Parameters n and A are closely related to the node density,which determines the number of nodes participating in thecooperation. Therefore, we varied n and A to observe how theperformance is influenced by the node density. The choice oferror function f depends on the error correction coding schemeemployed. In this study, we assume that a convolutional codewith a constraint length of two is used as the error correctioncoding scheme with a packet length of 1,024 [38]. Hence, weused the actual packet error rate obtained through extensivesimulations with the aforementioned convolutional code for thepacket error function f . For the choice of áô, we used 10−2,a value often adopted as the target packet error rate in manysituations. Finally, we assumed that the node power Pi is limitedby Pmax = 250 mW, and Pi is uniformly distributed over a10 MHz bandwidth. Detailed values of the above channel andsystem parameters are summarized in Table I.B. ConnectivityTo compare the level of performance achievable with theproposed topology control schemes, we first consider a metriccalled connectivity to measure the average proportion of nodesconnected to a node. Before proceeding with the formal definitionof metric connectivity, we observe that the performance ofa given topology control scheme depends not only on the valuesof n and A but also on the distribution of these n nodes over areaA. For this reason, we assume that n(≥ 2) nodes are randomlyand uniformly distributed over a given area A in the followingdiscussion.To formally define connectivity, we first denote the set ofall nodes connected to node vi by R(vi). We note that the setR(vi) depends on the choice of topology control schemes. Forinstance, in the NNTC scheme, R(vi) is the set of all nodesconnected to vi by an N-N edge. On the other hand, in acooperative topology control scheme, R(vi) consists of all thenodes that are connected through cascaded N-N and cascadedcooperative edges. Therefore, the connectivity à (of a giventopology control scheme) is defined asà =1nE_nÓi=1|R(vi)|n−1, (13)where |R(vi)| denotes the cardinality of R(vi). Here, the expectationE[·] has been taken because the cardinality |R(vi)|depends on how the nodes are distributed over a given area.We note that R(vi)/(n − 1) is the proportion of nodes thatare connected to vi and hence à is the expected value of itsarithmetic mean. For notational convenience, the connectivitiesof CCTC, NCTC, NNTC, and CNTC schemes are denoted byÃCC, ÃNC, ÃNN, and ÃCN, respectively.In Fig. 4, the connectivity for various topology controlschemes is shown as a function of the number of nodes nfor three different areas and two different environments. Mostimportantly, we observe that ÃCC ≥ ÃCN ≥ ÃNC ≥ ÃNN forall values of n and A and for any environment considered.We clearly see that either transmitter or receiver cooperationimproves connectivity. The fact that the CCTC scheme achievesthe highest connectivity is hardly surprising, hence what weactually need to observe is how the NCTC and CNTC schemes1866 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 4. Connectivity as a function of the number of nodes for various topology control schemes in various communication environments. (a) Urban. (b) Suburban.perform in comparison to it. In particular, since ÃCN ≤ ÃNC,we conclude that transmitter cooperation is more effective thanreceiver cooperation at achieving connectivity.C. Power ConsumptionSo far, we have observed that the CCTC scheme achievesthe highest connectivity and that the connectivity gap betweenthe CNTC and CCTC schemes is not large. In fact, it is notmore than 8% in most cases. Consequently, it is possible tosay that the CNTC scheme is a good alternative to the CCTCscheme if we consider connectivity only. However, the CNTCscheme is not as efficient as the CCTC scheme in terms ofpower consumption. Before proceeding with the analysis ofpower consumption, we define ˆ ECC to be the set of cluster pairscorresponding to the edges in ECC. In other words, (Ùl ,Ùm) ∈ˆ ECC, if and only if the edge set ECC contains the C-C edgebetween Ùl and Ùm. In a similar way, we denote the sets of thecluster pairs corresponding to edges in ENC and ECN by ˆ ENCand ˆ ECN, respectively.To quantitatively compare the power consumption of theCCTC and CNTC schemes, we now consider the following twoquantities¯PCC =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCC(ð)⎤⎦ (14)and¯PCN =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCN(ð)⎤⎦, (15)where PCN(ð) denotes the minimum C-N round-trip powerbetween the pair ð of clusters, similarly to PCC(ð) and PNC(ð)as defined in Section IV.We note that these quantities representthe average power required per each node to establish cooperativeedges between clusters in ˆ ECC ∩ ˆ ECN. Consequently, bycomparing ¯PCC and ¯PCN, we intend to compare the power requiredfor the CCTC and CNTC schemes to establish commoncooperative edges.Before proceeding with the evaluation of ¯PCC and ¯PCN, wefirst note that the two sets ˆ ECC− ˆ ECN and ˆ ECN − ˆ ECC of clusterpairs are not necessarily empty. Because the CCTC schemeemploys receiver cooperation in addition to transmitter cooperation,it appears reasonable to expect ˆ ECC − ˆ ECN to containsome sizable number of cooperative edges and ˆ ENC − ˆ ECC tobe empty. In fact, the average number of elements in ˆ ECC −ˆ ECN reaches as much as 25% of that of ˆ ECC ∩ˆ ECN in manysituations. However, interestingly, ˆ ECN − ˆ ECC is not necessarilyempty. This is because of the employment of MSF algorithm,that removes some redundant links. In other words, in CCTCschemes, some links used in the CNTC scheme are eliminatedby applying the MSF algorithms in some rare situations. Fromour numerical analysis, we found that the average cardinalityof ˆ ECN − ˆ ECC sometimes reaches as much as 8% of that ofˆ ECC∩ ˆ ECN. However, in most cases, the set ˆ ECN− ˆ ECC is emptyand hence ˆ ECC ∩ ˆ ECN is the same as ˆ ECN.Fig. 5(a) illustrates how the values of ¯PCC and ¯PCN changeas a function of the number of nodes n. We note that ¯PCCfirst increases as n increases and then decreases after n reachesa certain value. A similar tendency can be found in ¯PCN. Toexplain this non-monotonic performance of ¯PCC and ¯PCN, wedefine two quantitiesFCC =E_Óð∈ˆECC∩ˆECNPCC(ð)_E_|ˆ ECC ∩ ˆ ECN|_ (16)andFCN =E_Óð∈ˆECC∩ˆECNPCN(ð)_E_|ˆ ECC ∩ ˆ ECN|_ , (17)MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1867Fig. 5. The average additional power required per each node to establish cooperative edges in CCTC and CNTC schemes. (a) ¯PCN and ¯PCC. (b) ¯PCN over ¯PCC.to describe the average power consumed to establish a C-C linkand a C-N link, respectively. As a result, ¯PCC and ¯PCN can berewritten as¯PCC =1n·FCC ·N (18)and¯PCN =1n·FCN ·N , (19)where N = E[|ˆ ECC ∩ ˆ ECN|].While we cannot provide fully analytical behaviors of thequantities ¯PCC and ¯PCN, which is very difficult, it will be meaningfulto consider their qualitative behaviors. First, we note thatthe quantities FCC and FCN are mainly affected by the distancebetween clusters. It is natural to expect that the average clusterto-cluster distance will decrease with an increased number ofnodes n. However, the average cluster-to-cluster distance decreasesas a very slowly varying function of n, particularly aftern reaches a certain critical value. This is because two clustersare merged into one if the distance between them becomes tooclose. As a consequence, FCC and FCN decrease very slowly asn increases. For example, the minimum observed value of FCCwas only about 25% lower than the maximum observed value inthe simulation performed for an urban 2 × 2 km situation wheren ranged from 10 to 100. Because the quantities FCC and FCNare relatively unaffected by the variation of n, the behaviors of¯PCC and ¯PCN can possibly be accounted for by the behaviors ofthe average number of elementsN in ˆ ECC∩ ˆ ECN, which, in fact,varies very significantly as n varies. Let us observe, when thenode density is sufficiently low, that N increases as n increases,since increased n results in an increased number of clustersand then in an increased number of edges. However, when thenode density is high enough, adding nodes no longer makes thenumber of clusters larger because the addition of nodes nowresults in cluster unification. For this reason, N first increasesup to a certain critical value of n and then decreases againas n grows further. However, it is very difficult to predict thebehavior of N in a fully analytical manner, since N depends ontoo many factors such as node distribution, channel and fadingmodels, error probability function, and so on. As far as weknow, only a few analytical results [35], [36] have been derivedfor non-cooperative communications with an infinite number ofnodes and none for general cases or cooperative environments.We now discuss the simulation results of comparing ¯PCC and¯PCN. Because FCC and FCN vary slowly as functions of n, thevariations of ¯PCC and ¯PCN are dominantly determined by 1/nand N . When n = 10, N is almost zero since a very smallnumber of clusters exist and they are located too far away.As n increases up to a certain value, the number of clustersincreases so that the chance of cooperative communication alsoincreases. In this region, N grows faster than n, therefore, ¯PCCbecomes larger. On the other hand, if n exceeds a certain value,the number of clusters decreases, and eventually, it goes to one.Therefore, N quickly converges to zero with growing n, andthis is why ¯PCC decreases. In Fig. 5(a), we next observe that ¯PCCis always smaller than ¯PCN. To quantify the difference betweenthe two values, we illustrate the values of ¯PCN/¯PCC in Fig. 5(b),where we clearly see that ¯PCN is about 10–100% larger than¯PCC. From this figure, we clearly see that the CCTC schemerequires significantly less power than the CNTC scheme toestablish the same cooperative edges.Here, the question arises as to how the NCTC schemecompares to the CCTC scheme in terms of power consumption.First, we can compare the amount of power required for theCCTC and NCTC schemes to establish common cooperativeedges. In a similar comparison in Fig. 5, we noted that ¯PCCis significantly smaller than ¯PCN. However, in the case of theCCTC and NCTC schemes, there is virtually no differencebetween the powers required to establish common cooperativeedges. This is related to the assumption that the nodesparticipating in the cooperative transmission use the same1868 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 6. The relative amount of power required to establish one more additional cooperative edge with the CCTC scheme in comparison with the NCTC andCNTC schemes. (a) Urban. (b) Suburban.transmission power as in CCTC scheme. Because of this constrainton the transmission power, only one node is selected,even in the CCTC scheme, to transmit signals almost alwayswhenever the cooperative edge is contained in both ˆ ECC andˆ ENC. Therefore, it can be said that the NCTC scheme is almostas efficient as the CCTC scheme in terms of power consumption.Consequently, if the connectivity is of less priority thanthe power consumption or if the situation is such that theconnectivities of CCTC and NCTC are almost the same valuesbecause of a very high node density, the NCTC scheme can beconsidered to be a good alternative to the CCTC scheme. This isparticularly so because the average power required to establisha cooperative edge in ˆ ECC− ˆ ENC is significantly larger, in manycases, than the power required to establish cooperative edgein ˆ ENC.To illustrate this, we consider the metric ñCCNC defined asñCCNC =DCCNCKNC(20)in whichDCCNC =E_Óð∈ˆECC−ˆENCPCC(ð)_E_|ˆ ECC − ˆ ENC|_ (21)andKNC =E_Óð∈ˆENCPNC(ð)_E_|ˆ ENC|_ . (22)We note that DCCNC denotes the power required to establish oneC-C link that can not be established in NCTC scheme andthat KCCNC is the power consumption required for one N-C link.Consequently, the metric ñCCNC measures the relative amount ofpower required to establish one more additional cooperativeedge using the CCTC scheme in comparison to the NCTCscheme. In a similar manner, we define the metric ñCCCN byñCCCN =E_Óð∈ˆECC−ˆECNPCC(ð)_E_|ˆ ECC − ˆ ECN|_ ÷E_Óð∈ˆECNPCN(ð)_E_|ˆ ECN|_ (23)=DCCCNKCN(24)to quantify the relative amount of power required to establishone more additional cooperative edge using the CCTC schemein comparison to the CNTC scheme.In Fig. 6, we plot ñCCNC and ñCCCN as functions of n. Here,we first observe that the numerical values of ñCCNC and ñCCCNare around 3 and 1.2, respectively, for all cases considered.We note that, as mentioned in the explanation of Fig. 5, thepower consumed to establish a single cooperative link decreaseswith growing n so that DCCNC, DCCCN, KNC, and KCN are alldecreasing functions of n. In addition, we note that the powerrequired to establish a cooperative link is mainly affected bythe number of transmitting nodes and the transmitting powerof each node. We also note that the cooperative link betweentwo clusters is established by only a small number of nodeslocated near the boundary of each cluster, even when the clustersize is very large. This means that the number of transmittingnodes is almost constant, regardless of n. Therefore, the rateof decreasing power consumption is primarily affected by thetransmitting power of each node, which is closely related to thedistance between clusters. Because the configuration of clustersis identically given by the NNTC scheme, as n increases, thedecreasing rate of the power required to establish cooperativelinks is relatively similar for all three cooperative schemes,namely, the NCTC, CNTC, and CCTC schemes. For thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1869reason, the ratios DCCNC/KNC and DCCCN/KCN remain roughly thesame regardless of the value of n.We next observe that the values of ñCCNC, plotted by solid purplelines, are always around three. This means that to establishan edge that cannot be established in the NCTC scheme, theCCTC scheme requires about three times the power requiredto establish an edge in the NCTC scheme, regardless of thescenario and node density considered. Combining this resultwith the connectivity result in Fig. 4, we gain an importantinsight into the system design. When n = 50, the connectivityof the CCTC scheme is almost twice that of the NCTC scheme.Therefore, a three-fold increase in power consumption could bea reasonable choice if connectivity is of the highest priority.However, when n = 100, by employing the CCTC scheme,one would achieve 0.13% increase in connectivity, but threetimes more power would still be required. Therefore, somesystem designers may prefer the NCTC scheme to the CCTCscheme, for instance, where power efficiency is of the highestpriority or connectivity increase is not an issue. In contrast,ñCCCN, plotted by dotted by the green line, is about 1.2 in allcases. This means that only 20% more power is required to adda new cooperative edge using the CCTC scheme that cannotbe established in the CNTC scheme. Consequently, one canreplace the CNTC scheme with the CCTC scheme without aserious power consumption burden, regardless of node density.VI. CONCLUSIONIn this paper, we proposed to employ receiver cooperationin topology control to improve energy efficiency as well asnetwork connectivity. In particular, we proposed two centralizedtopology control schemes, one based solely on receivercooperation, and the other based both on transmitter and receivercooperations. For comparison, we also considered atopology control scheme that is based solely on transmittercooperation. By extensive simulation, we showed that we canimprove both connectivity and energy efficiency if we employreceiver cooperation in addition to transmitter cooperation.Consequently, it is generally more desirable to employ bothreceiver and transmitter cooperation than to employ transmittercooperation only. We also showed that the increase in networkconnectivity by employing transmitter cooperation in additionto receiver cooperation is at the expense of significantly increasedenergy consumption. For this reason, we conclude thatthe system based only on receiver cooperation could prove to bea good alternative to one based both on receiver and transmittercooperation, if energy efficiency is of the highest priority or theincrease in connectivity is no longer of serious concern.
Real-Time Path Planning Based on Hybrid-VANET-Enhanced Transportation System
Abstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.
Real-Time Detection of Traffic From Twitter Stream Analysis
Real-Time Detection of Traffic FromTwitter Stream AnalysisAbstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.
Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revocation
REAL-TIME BIG DATA ANALYTICAL ARCHITECTURE FOR REMOTE
SENSING APPLICATION
ABSTRACT:
In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.
The proposed architecture comprises three main units:
1) Remote sensing Big Data acquisition unit (RSDU);
2) Data processing unit (DPU); and
3) Data analysis decision unit (DADU).
First, RSDU acquires data from the
satellite and sends this data to the Base Station, where initial processing
takes place. Second, DPU plays a vital role in architecture for efficient
processing of real-time Big Data by providing filtration, load balancing, and parallel
processing. Third, DADU is the upper layer unit of the proposed architecture,
which is responsible for compilation, storage of the results, and generation of
decision based on the results received from DPU.
INTRODUCTION:
EXISTING SYSTEM:
Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.
The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.
In the case of simple algorithms, this is
quite easy: the input is just split into different non-overlapping pieces that
are processed one by one. But most algorithms do consider the neighborhood of
each pixel. As a consequence, in most cases, the data will have to be split
into partially overlapping pieces. The objective is to obtain the same result
as the original algorithm as if the processing was done in one go. Depending on
the algorithm, this is unfortunately not always possible.
DISADVANTAGES:
- A reader that loads the image, or part of the image in memory from the file on disk;
- A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;
- A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;
- A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.
PROPOSED SYSTEM:
We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.
On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.
We proposed architecture has the
capability of dividing, load balancing, and parallel processing of only useful
data. Thus, it results in efficiently analyzing real-time remote sensing Big
Data using earth observatory system. Furthermore, the proposed architecture has
the capability of storing incoming raw data to perform offline analysis on
largely stored dumps, when required. Finally, a detailed analysis of remotely
sensed earth observatory Big Data for land and sea area are provided using
.NET. In addition, various algorithms are proposed for each level of RSDU, DPU,
and DADU to detect land as well as sea area to elaborate the working of
architecture.
ADVANTAGES:
Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.
Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.
The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.
We have used the proposed architecture to perform
analysis and proposed an algorithm for handling, processing, analyzing, and
decision-making for remote sensing Big Data images using our proposed
architecture.
HARDWARE & SOFTWARE REQUIREMENTS:
HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MYSQL Server
- Server : Apache Tomact Server
- Script : JSP Script
- Document : MS-Office 2007
ARCHITECTURE DIAGRAM
MODULES:
DATA ANALYSIS DECISION UNIT (DADU):
DATA PROCESSING UNIT (DPU):
REMOTE SENSING APPLICATION RSDU:
FINDINGS AND DISCUSSION:
ALGORITHM
DESIGN AND TESTING:
MODULES DESCRIPTION:
DATA PROCESSING UNIT (DPU):
In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.
Each processing server has its algorithm
implementation for processing incoming segment of data from FLBS. Each
processing server makes statistical calculations, any measurements, and
performs other mathematical or logical tasks to generate intermediate results
against each segment of data. Since these servers perform tasks independently
and in parallel, the performance proposed system is dramatically enhanced, and
the results against each segment are generated in real time. The results
generated by each server are then sent to the aggregation server for
compilation, organization, and storing for further processing.
DATA ANALYSIS DECISION UNIT (DADU):
DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.
The aggregation server also sends the
same copy of that result to the decision-making server to process that result
for making decision. The decision-making server is supported by the decision
algorithms, which inquire different things from the result, and then make
various decisions (e.g., in our analysis, we analyze land, sea, and ice,
whereas other finding such as fire, storms, Tsunami, earthquake can also be
found). The decision algorithm must be strong and correct enough that
efficiently produce results to discover hidden things and make decisions. The
decision part of the architecture is significant since any small error in
decision-making can degrade the efficiency of the whole analysis. DADU finally
displays or broadcasts the decisions, so that any application can utilize those
decisions at real time to make their development. The applications can be any
business software, general purpose community software, or other social networks
that need those findings (i.e., decision-making).
REMOTE SENSING APPLICATION RSDU:
Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.
However, to make the raw data into image
format, the remote sensing satellite uses effective data analysis, remote
sensing satellite preprocesses data under many situations to integrate the data
from different sources, which not only decreases storage cost, but also
improves analysis accuracy. The data must be corrected in different methods to
remove distortions caused due to the motion of the platform relative to the
earth, platform attitude, earth curvature, nonuniformity of illumination,
variations in sensor characteristics, etc. The data is then transmitted to
Earth Base Station for further processing using direct communication link. We
divided the data processing procedure into two steps, such as real-time Big
Data processing and offline Big Data processing. In the case of offline data
processing, the Earth Base Station transmits the data to the data center for
storage. This data is then used for future analyses. However, in real-time data
processing, the data are directly transmitted to the filtration and load
balancer server (FLBS), since storing of incoming real-time data degrades the
performance of real-time processing.
FINDINGS AND DISCUSSION:
Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.
1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.
2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.
3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.
4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.
5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.
The MPH and SPH data are in ASCII
format, whereas all the other data sets are in binary format. MDS, ADS, and
GADs consist of the sequence of records and one or more fields of the data for
each record. In our case, the MDS contains number of records, and each record
contains a number of fields. Each record of the MDS corresponds to one row of
the satellite image, which is our main focus during analysis.
ALGORITHM DESIGN AND TESTING:
Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.
IMPLEMENTATION:
Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.
Healthcare scenarios, medical practitioners gather
massive volume of data about patients, medical history, medications, and other
details. The above-mentioned data are accumulated in drug-manufacturing
companies. The nature of these data is very complex, and sometimes the
practitioners are unable to show a relationship with other information, which
results in missing of important information. With a view in employing advance
analytic techniques for organizing and extracting useful information from Big
Data results in personalized medication, the advance Big Data analytic
techniques give insight into hereditarily causes of the disease.
ALGORITHMS:
This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.
The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.
ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.
CONCLUSION AND FUTURE:
In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.
The algorithms proposed in this paper
for each unit and subunits are used to analyze remote sensing data sets, which
helps in better understanding of land and sea area. The proposed architecture
welcomes researchers and organizations for any type of remote sensory Big Data
analysis by developing algorithms for each level of the architecture depending
on their analysis requirement. For future work, we are planning to extend the
proposed architecture to make it compatible for Big Data analysis for all
applications, e.g., sensors and social networking. We are also planning to use
the proposed architecture to perform complex analysis on earth observatory data
for decision making at realtime, such as earthquake prediction, Tsunami
prediction, fire detection, etc.
REFERENCES:
[1] D. Agrawal, S. Das, and A. E. Abbadi, “Big Data and cloud computing: Current state and future opportunities,” in Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.
[2] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for Big Data,” PVLDB, vol. 2, no. 2, pp. 1481–1492, 2009.
[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[4] H. Herodotou et al., “Starfish: A self-tuning system for Big Data analytics,” in Proc. 5th Int. Conf. Innovative Data Syst. Res. (CIDR), 2011, pp. 261–272.
[5] K. Michael and K. W. Miller, “Big Data: New opportunities and new challenges [guest editors’ introduction],” IEEE Comput., vol. 46, no. 6, pp. 22–24, Jun. 2013.
[6] C. Eaton, D. Deroos, T. Deutsch, G. Lapis, and P. C. Zikopoulos, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York, NY, USA: Mc Graw-Hill, 2012.
[7] R. D. Schneider, Hadoop for Dummies Special Edition. Hoboken, NJ, USA: Wiley, 2012.
Proof of Ownership In Deduplicated Storage With Mobile Device Efficiency
Cloud storage such as Dropbox and
Bitcasa is one of the most popular cloud services. Currently, with the
prevalence of mobile cloud computing, users can even collaboratively edit the
newest version of documents and synchronize the newest files on their smart
mobile devices. A remarkable feature of current cloud storage is its virtually infinite
storage. To support unlimited storage, the cloud storage provider uses data
deduplication techniques to reduce the data to be stored and therefore reduce
the storage expense. Moreover, the use of data deduplication also helps
significantly reduce the need for bandwidth and therefore improve the user
experience. Nevertheless, in spite of the above benefits, data deduplication
has its inherent security weaknesses. Among them, the most severe is that the
adversary may have an unauthorized file downloading via the file hash only. In
this article we first review the previous solutions and identify their
performance weaknesses. Then we propose an alternative design that achieves
cloud server efficiency and especially mobile device efficiency.
1.2 INTRODUCTION
Mobile devices have become prevalent in recent years, and mobile computing has been a growing trend. Meanwhile, cloud computing is definitely the biggest revolution in recent decades. Many tasks, such as document editing and file backup, have been shifted from end devices to the cloud. Therefore, with the convergence of mobile computing and cloud computing, along with the recent development of the 5G communication standard that establishes more reliable and faster communication channels, mobile cloud computing (MCC) could be a rapidly growing field that deserves to be investigated and explored.
Deduplicated Storage in Mobile Cloud Computing Among cloud services, cloud storage with the capability of file backup and synchronization could be the most popular service that enables mobile users to access their files everywhere. Dropbox (https://www.dropbox.com/) and Bitcasa (https://www.bitcasa.com/) are two examples that offer easy-to-use file backup and synchronization services. Several remarkable features of such cloud storage can be identified. It has high availability, which means that the user’s data will be replicated over cloud servers worldwide and is guaranteed to be accessible whenever the user has the need. It has the flexibility in a pay-as-you-go model, which means that the user can gain additional storage immediately whenever the user is willing to make an extra payment. The most important feature is that it has virtually infinite storage space, which means that the user can backup whatever he/she wants to be uploaded to the cloud. A renowned example is Bitcasa, which offers “unlimited storage” that enables the user to upload virtually everything. Offering infinite storage space might cause a severe economic burden on the cloud storage provider.
However, a technique called data deduplication helps significantly reduce the cost of storage. Data deduplication has been widely implemented by cloud storage providers including Dropbox and Bitcasa. According to the report in [8] (http://www.snia.org), the use of data deduplication in business applications may reduce the data to be stored and thus achieve disk and bandwidth savings of more than 90 percent. The power of data deduplication is achieved by avoiding storing the same file multiple times. The storage saving is more obvious especially when the popular multimedia contents such as music and movies are considered. The replicated contents create an additional storage need the first time they are uploaded, but create no extra storage need for subsequent uploads. In addition to storage saving, if the data content has been in the storage, then the replicated content has no need to be transmitted, achieving bandwidth saving. Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not.
We can see that server side deduplication does not
produce bandwidth saving because the server performs the deduplication after
the file has been received. On the other hand, client side deduplication adopts
a more aggressive method: the user calculates and sends the hash of the file
before uploading the file. Upon receiving the hash, the server checks whether
the hash is already in storage. The user is asked to send nothing and the server
associates the user with the existing file if so. The user is asked to upload
the file otherwise. An illustrative example is shown in Fig. 2, where user 1
first uploads files F1 and F2 in Fig. 2a. Then the cloud knows from the hashes
h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and
sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2,
according to Acknowledgments, sends only F3, saving the transmission of F1.
Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to
adopt client side deduplication because of its storage and bandwidth savings.
In particular, in addition to the reduced storage requirement, the client side
deduplication can also reduce the need for file transmission, allowing the
reduction of waiting time for users and energy consumption for the server. We
particularly mention that even with the increased bandwidth of the coming 5G
communication standard, the data rate of wireless links is still not compatible
to that of wired links. Thus, if we consider the mobile devices accessing cloud
storage services, client side deduplication becomes an inevitable technique for
MCC applications.
1.3 LITRATURE SURVEY
DUPLESS: SERVERAIDED ENCRYPTION FOR DEDUPLICATED STORAGE
AUTHOR: M. Bellare, S. Keelveedhi, and T. Ristenpart
PUBLISH: Proc. 22nd USENIX Conf. Sec. Symp., 2013, pp. 179–194.
EXPLANATION:
Cloud storage service providers
such as Dropbox, Mozy, and others perform deduplication to save space by only
storing one copy of each file uploaded. Should clients conventionally encrypt
their files, however, savings are lost. Message-locked encryption (the most
prominent manifestation of which is convergent encryption) resolves this
tension. However it is inherently subject to brute-force attacks that can
recover files falling into a known set. We propose an architecture that
provides secure deduplicated storage resisting brute-force attacks, and realize
it in a system called DupLESS. In DupLESS, clients encrypt under message-based
keys obtained from a key-server via an oblivious PRF protocol. It enables
clients to store encrypted data with an existing service, have the service
perform deduplication on their behalf, and yet achieves strong confidentiality
guarantees. We show that encryption for deduplicated storage can achieve
performance and space savings close to that of using the storage service with
plaintext data.
FAST AND SECURE LAPTOP BACKUPS WITH ENCRYPTED DE-DUPLICATION
AUTHOR: P. Anderson and L. Zhang
PUBLISH: Proc. 24th Int. Conf. Large Installation Syst. Admin., 2010, pp. 29–40.
EXPLANATION:
Many people now store large quantities of personal and corporate data on laptops or home computers. These often have poor or intermittent connectivity, and are vulnerable to theft or hardware failure. Conventional backup solutions are not well suited to this environment, and backup regimes are frequently inadequate. This paper describes an algorithm which takes advantage of the data which is common between users to increase the speed of backups, and reduce the storage requirements. This algorithm supports client-end per-user encryption which is necessary for confidential personal data. It also supports a unique feature which allows immediate detection of common subtrees, avoiding the need to query the backup system for every file. We describe a prototype implementation of this algorithm for Apple OS X, and present an analysis of the potential effectiveness, using real data obtained from a set of typical users. Finally, we discuss the use of this prototype in conjunction with remote cloud storage, and present an analysis of the typical cost savings.
SECURE DEDUPLICATION WITH EFFICIENT AND RELIABLE CONVERGENT KEY MANAGEMENT
AUTHOR: J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou
PUBLISH: IEEE Trans. Parallel Distrib. Syst., http://oi.ieeecomputersociety.org/10.1109/TPDS.2013.284, 2013
EXPLANATION:
Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Data de duplication is one of important data compression techniques for eliminating duplicate copies of repeating data, and has been widely used in cloud storage to reduce the amount of storage space and save bandwidth. Previous de duplication systems cannot support differential authorization duplicate check, which is important in many applications. In such an authorized de duplication system, each user is issued a set of privileges during system initialization Each file uploaded to the cloud is also bounded by a set of privileges to specify which kind of users is allowed to perform the duplicate check and access the files.
Before submitting his duplicate check
request for a file, the user needs to take this file and his own privileges as
inputs. The user is able to find a duplicate f or this file if and only if
there is a copy of this file and a matched privilege stored in cloud.
Traditional de duplication systems based on convergent encryption, although
providing confidentiality to some extent; do not support the duplicate check
with differential privileges. In other words, no differential privileges have
been considered in the de duplication based on convergent encryption technique.
It seems to be contradicted if we want to realize both de duplication and
differential authorization duplicate check at the same time.
2.1.1 DISADVANTAGES:
- De duplication systems cannot support differential authorization duplicate check.
- One critical challenge of cloud storage services is the management of the ever increasing volume of data.
- Users’ sensitive data are susceptible to both insider and outsider attacks.
- Sometimes de duplication impossible.
2.2 PROPOSED SYSTEM:
We propose an alternative design that
strikes a balance between server side efficiency and user side efficiency.
Before introducing the scheme’s details, we present two observations. First,
the POW schemes in are I/O efficient at the server side because the Merkle tree
root can be thought of as a compact summary of the file. Therefore, there is no
need for the cloud to access the disk to retrieve the file. Second, the user
side is computationally efficient in three s-POW schemes because the user is
simply required only to answer a few bits of the file. With the above two
observations, our design strategy is to have a probabilistic data structure for
the compact summary of the file, in contrast to the deterministic data
structure, Merkle hash tree, in the POW schemes. The query challenge is also
modified as random blocks, in contrast to the random bits in s-POW schemes. An
overview of the proposed POW scheme goes as follows.
2.2.1 ADVANTAGES:
POW scheme such as the bandwidth requirement, I/O overhead at both user and server sides, and the computation overhead at both sides concern the performance, the second is less known in the POW design. More specifically, cloud storage usually has a storage hierarchy: the memory (primary storage) and disk (secondary storage).
The execution of a POW scheme might require the user and cloud to access the file stored in the disk multiple times. The server might also need to keep the verification object in either the memory or the disk to verify the user’s claim.
The above all might result in a huge amount of I/O delay because of the access time gap between the memory and disk. In this article we focus only on the abuse of a file hash to gain the ownership of the file and aim to design a POW scheme with minimum performance overhead.
- To prevent unauthorized access, a secure proof of ownership (POW) protocol is also needed to provide the proof that the user indeed owns the same file when a duplicate is found.
- It makes overhead to minimal compared to the normal convergent encryption and file upload operations.
- Data confidentiality is maintained.
- Secure compared to proposed techniques
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
JAVA
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MYSQL Server
- Server : Apache Tomact Server
- Script : JSP Script
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
USER:
1
ADMIN:
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
SENDER USER:
RECEIVER USER:
3.5 ACTIVITY DIAGRAM:
SENDER LOGIN:
RECEIVER LOGIN:
CHAPTER 4
4.0 IMPLEMENTATION:
MOBILE CLOUD COMPUTING:
Mobile Cloud Computing (MCC) is the combination of cloud computing, mobile computing and wireless networks to bring rich computational resources to mobile users, network operators, as well as cloud computing providers. The ultimate goal of MCC is to enable execution of rich mobile applications on a plethora of mobile devices, with a rich user experience. MCC provides business opportunities for mobile network operators as well as cloud providers. “A rich mobile computing technology that leverages unified elastic resources of varied clouds and network technologies toward unrestricted functionality, storage, and mobility to serve a multitude of mobile devices anywhere, anytime through the channel of Ethernet or Internet regardless of heterogeneous environments and platforms based on the pay-as-you-use principle.
ARCHITECTURE:
MCC uses computational
augmentation approachesby which resource-constraint mobile devices
can utilize computational resources of varied cloud-based resources. In MCC,
there are four types of cloud-based resources, namely distant immobile clouds,
proximate immobile computing entities, proximate mobile computing entities, and
hybrid (combination of the other three models). Giant clouds such as Amazon EC2
are in the distant immobile groups whereas cloudlet or surrogates are member of proximate
immobile computing entities. Smartphones, tablets, handheld devices, and
wearable computing devices are part of the third group of cloud-based resources
which is proximate mobile computing entities.
DIAGRAM:
In the MCC landscape, an amalgam of
mobile computing, cloud computing, and communication networks (to augment
smartphones) creates several complex challenges such as Mobile Computation
Offloading, Seamless Connectivity, Long WAN Latency, Mobility Management,
Context-Processing, Energy Constraint, Vendor/data Lock-in, Security and
Privacy, Elasticity that hinder MCC success and adoption.
Although significant research and development in MCC is available in the literature, efforts in the following domains are still lacking:
- Architectural issues: Reference architecture for heterogeneous MCC environment is a crucial requirement for unleashing the power of mobile computing towards unrestricted ubiquitous computing.
- Energy-efficient transmission: MCC requires frequent transmissions between cloud platform and mobile devices, due to the stochastic nature of wireless networks, the transmission protocol should be carefully designed.
- Context-awareness issues: Context-aware and socially-aware computing are inseparable traits of contemporary handheld computers. To achieve the vision of mobile computing among heterogeneous converged networks and computing devices, designing resource-efficient environment-aware applications is an essential need.
- Live VM migration issues: Executing resource-intensive mobile application via Virtual Machine (VM) migration-based application offloading involves encapsulation of application in VM instance and migrating it to the cloud, which is a challenging task due to additional overhead of deploying and managing VM on mobile devices.
- Mobile communication congestion issues: Mobile data traffic is tremendously hiking by ever increasing mobile user demands for exploiting cloud resources which impact on mobile network operators and demand future efforts to enable smooth communication between mobile and cloud endpoints.
- Trust, security, and privacy issues: Trust is an essential factor for the success of the burgeoning MCC paradigm.
PROOF OF OWNERSHIP:
An even more severe and direct security threat from using deduplicated cloud storage is that the adversary may gain the ownership of files by only eavesdropping on file hashes. A closer look at client side deduplication can find that anyone in possession of the file hash can gain ownership of the file by uploading the file hash. More specifically, the cloud considers receiving a store request for a file already in the storage, avoids the redundant file transmission, and then adds the user as an additional owner of the file. An illustrative example is shown in Fig. 3d. Such a situation is apparently undesirable because in theory the adversary cannot infer the file content via the hash.
However, in this case, once the adversary knows the
hash, it is able to download the entire file content. On the other hand, in
practice, the user considers the hash unharmful and in some cases publishes the
hashes as timestamps. However, the publicly available hashes can be abused to
gain the file. This security weakness comes from using the static and short
piece of information (hash) as a way of claiming file ownership. Motivated by
this observation, Halevi et al. [10] introduce the notion of proof of ownership
(POW). A POW scheme is jointly executed by the cloud and user such that the
user can prove to the cloud that it is indeed in possession of the file.
4.1 ALGORITHM:
PUBLIC KEY INFRASTRUCTURE (PKI) AND PRIVATE KEY GENERATOR (PKG):
In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which is based on the Diffie–Hellman key exchange. It was described by Taher Elgamal in 1985. ElGamal encryption is used in the free GNU Privacy Guard software, recent versions of PGP, and other cryptosystems. The DSA (Digital Signature Algorithm) is a variant of the ElGamal signature scheme, which should not be confused with ElGamal encryption. The security of the ElGamal scheme depends on the properties of the underlying group as well as any padding scheme used on the messages.
If the computational Diffie–Hellman assumption (CDH) holds in the underlying cyclic group , then the encryption function is one-way. If the decisional Diffie–Hellman assumption (DDH) holds in , then ElGamal achieves semantic security. Semantic security is not implied by the computational Diffie–Hellman assumption alone. See decisional Diffie–Hellman assumption for a discussion of groups where the assumption is believed to hold.
To achieve chosen-ciphertext security, the scheme must be further modified, or an appropriate padding scheme must be used. Depending on the modification, the DDH assumption may or may not be necessary.
Other schemes related to ElGamal which achieve security against chosen ciphertext attacks have also been proposed. The Cramer–Shoup cryptosystem is secure under chosen ciphertext attack assuming DDH holds for. Its proof does not use the random oracle model. Another proposed scheme is DHAES whose proof requires an assumption that is weaker than the DDH assumption.
4.2 MODULES:
SECURE USER MODULES:
DEDUPLICATED STORAGE:
CHECK DEDUPLICATES:
APPLY POW SCHEME:
SECURE
SEND KEY:
4.3 MODULE DESCRIPTION:
SECURE USER MODULES:
In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first.
- Registration
- File View
- Encryption
- Download
- Upload Files
- Encrypt and save to cloud
DEDUPLICATED STORAGE:
Client side deduplication incurs its own security weaknesses. First, the privacy of the file existence in the cloud may be compromised because the adversary may try to upload the candidate files to see whether the deduplication takes place. If the deduplication takes place, this will be an indica tor of the file’s existence. Otherwise, the adversary may infer the file’s nonexistence. The situation becomes even worse when we consider the low-entropy files because the adversary may exhaustively create different files and upload the hashes to check the file’s existence. For example, a curious colleague may query his/her manager’s salary by uploading different salary sheets because the sheets are of a similar form, restricting the number of file contents to be tested.
CHECK DEDUPLICATES:
Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not. We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a.
Then the cloud knows from the hashes h(F1) and h(F2)
sent by user 2 that there has been a copy of F1 in storage and sends a positive
Acknowledgment and negative Acknowledgment to user 2. User 2, according to
Acknowledgments, sends only F3, saving the transmission of F1. Public cloud
storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client
side deduplication because of its storage and bandwidth savings. In particular,
in addition to the reduced storage requirement, the client side deduplication
can also reduce the need for file transmission, allowing the reduction of
waiting time for users and energy consumption for the server. We particularly
mention that even with the increased bandwidth of the coming 5G communication
standard, the data rate of wireless links is still not compatible to that of
wired links. Thus, if we consider the mobile devices accessing cloud storage
services, client side deduplication becomes an inevitable technique for MCC
applications.
APPLY POW SCHEME:
The POW schemes in performance very well on the server side since only a small size index (tree root) needs to be stored in the main memory. However, the proof of ownership is achieved by the user sending an authentication path of size O(log |f|) to the cloud, resulting in more communication overhead and computation load on the cloud. The I/O overhead of the user side is also increased, compared to the POW schemes in, because the user needs to retrieve the entire file. On the other extreme, although the s-POW schemes in have great computation and I/O efficiency in the user side, its I/O burden on the cloud is significantly increased since the cloud is required to retrieve random bits from the secondary storage.
In this article we propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.
SECURE SEND KEY:
Once the key request was received, the sender can send the key or he can decline it. With this key and request id which was generated at the time of sending key request the receiver can decrypt the message.
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION AND FUTURE:
We propose an alternative POW design on
the problem of unauthorized file downloading in deduplicated cloud storage. In
our design, the use of probabilistic data structure, the Bloom filter,
primarily contributes to the overhead reduction. Since the Bloom filter has
been used widely in various applications and is easy to be implemented, our
proposed POW scheme is considered realistic and can be deployed in real-world
cloud storage services. Despite the use of the Bloom filter in reducing the I/O
needs, the size of the Bloom filter may grow with the number of files stored in
the cloud. The Bloom filter may also be of a huge size so that it needs to be
partitioned and part of it needs to be stored in the disk. Thus, one possible
future research focus is to develop a more succinct data structure or devise a
new index mechanism such that the index (the Bloom filter in this article) can
be fit into the memory even in the case of a huge number of files in the cloud.
Privacy-Preserving Detection of Sensitive Data Exposure
An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.
Finally, the prefetched data can be
pushed to the relevant client machine from the storage server. Through a series
of evaluation experiments with a collection of application benchmarks, we have
demonstrated that our presented initiative prefetching technique can benefit
distributed file systems for cloud environments to achieve better I/O
performance. In particular, configurationlimited client machines in the cloud
are not responsible for predicting I/O access operations, which can definitely
contribute to preferable system performance on them.
1.2 INTRODUCTION
The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.
However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.
On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.
Furthermore, considering only disk I/O events can
reveal the disk tracks that can offer critical information to perform I/O
optimization tactics certain prefetching techniques have been proposed in
succession to read the data on the disk in advance after analyzing disk I/O
traces. But, this kind of prefetching only works for local file systems, and
the prefetched data iscached on the local machine to fulfill the application’s
I/O requests passively in brief, although block access history reveals the
behavior of disk tracks, there are no prefetching schemes on storage servers in
a distributed file system for yielding better system performance. And the
reason for this situation is because of the difficulties in modeling the block
access history to generate block access patterns and deciding the destination
client machine for driving the prefetched data from storage servers.
1.3 LITRATURE SURVEY
PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS
AUTHOR: J. Liao, Y. Ishikawa
PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.
EXPLANATION:
This paper presents PARTE, a
prototype parallel file system with active/standby configured metadata servers
(MDSs). PARTE replicates and distributes a part of files’ metadata to the
corresponding metadata stripes on the storage servers (OSTs) with a per-file
granularity, meanwhile the client file system (client) keeps certain sent
metadata requests. If the active MDS has crashed for some reason, these client
backup requests will be replayed by the standby MDS to restore the lost
metadata. In case one or more backup requests are lost due to network problems
or dead clients, the latest metadata saved in the associated metadata stripes
will be used to construct consistent and up-to-date metadata on the standby
MDS. Moreover, the clients and OSTs can work in both normal mode and recovery
mode in the PARTE file system. This differs from conventional active/standby
configured MDSs parallel file systems, which hang all I/O requests and metadata
requests during restoration of the lost metadata. In the PARTE file system,
previously connected clients can continue to perform I/O operations and
relevant metadata operations, because OSTs work as temporary MDSs during that
period by using the replicated metadata in the relevant metadata stripes.
Through examination of experimental results, we show the feasibility of the
main ideas presented in this paper for providing high availability metadata
service with only a slight overhead effect on I/O performance. Furthermore,
since previously connected clients are never hanged during metadata recovery,
in contrast to conventional systems, a better overall I/O data throughput can
be achieved with PARTE.
EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS
AUTHOR: P. Sehgal, V. Tarasov, E. Zadok
PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.
EXPLANATION:
Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.
FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS
AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al
PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.
EXPLANATION:
WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
The file system deployed in a
distributed computing environment is called a distributed file system, which is
always used to be a backend storage system to provide I/O services for various
sorts of data intensive applications in cloud computing environments. In fact,
the distributed file system employs multiple distributed I/O devices by
striping file data across the I/O nodes, and uses high aggregate bandwidth to
meet the growing I/O requirements of distributed and parallel scientific
applications benchmark to create OLTP workloads, since it is able to create
similar OLTP workloads that exist in real systems. All the configured client
file systems executed the same script, and each of them run several threads
that issue OLTP requests. Because Sysbench requires MySQL installed as a
backend for OLTP workloads, we configured mysqld process to 16 cores of storage
servers. As a consequence, it is possible to measure the response time to the
client request while handling the generated workloads.
2.1.1 DISADVANTAGES:
- Network delay in numerically and geographically remote file system access
- Mobile devices generally have limited processing power, battery life and storage
2.2 PROPOSED SYSTEM:
Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.
This paper makes the following two contributions:
1) Chaotic time series prediction and linear
regression prediction to forecast disk I/O access. We have modeled the disk I/O
access operations, and classified them into two kinds of access patterns, i.e.
the random access pattern and the sequential access pattern. Therefore, in
order to predict the future I/O access that belongs to the different access
patterns as accurately as possible (note that the future I/O access indicates
what data will be requested in the near future), two prediction algorithms
including the chaotic time series prediction algorithm and the linear
regression prediction algorithm have been proposed respectively. 2) Initiative
data prefetching on storage servers. Without any intervention from client file
systems except for piggybacking their information onto relevant I/O requests to
the storage servers. The storage servers are supposed to log disk I/O access
and classify access patterns after modeling disk I/O events. Next, by properly
using two proposed prediction algorithms, the storage servers can predict the
future disk I/O access to guide prefetching data. Finally, the storage servers
proactively forward the prefetched data to the relevant client file systems for
satisfying future application’s requests.
2.2.1 ADVANTAGES:
- The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.
- Less file operations via batched I/O requests through prefetching
- Cloud computing offers an illusion of
infinite computing resources
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
JAVA
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Script : Java Script
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
3.5
ACTIVITY DIAGRAM:
CHAPTER 4
4.0 IMPLEMENTATION:
I/O
ACCESS PREDICTION
4.1 ALGORITHM
MARKOV MODEL PREDICTION ALGORITHM
LINEAR
PREDICTION ALGORITHM
4.2 MODULES:
SERVER CLIENT MODULE:
DISTRIBUTED FILE SYSTEMS:
INITIATIVE DATA PREFETCHING:
ANALYSIS
OF PREDICTIONS:
4.3 MODULE DESCRIPTION:
SERVER CLIENT MODULE:
DISTRIBUTED FILE SYSTEMS:
INITIATIVE DATA PREFETCHING:
ANALYSIS
OF PREDICTIONS:
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION AND FUTURE WORK:
We have proposed, implemented and evaluated an initiative data prefetching approach on the storage servers for distributed file systems, which can be employed as a backend storage system in a cloud environment that may have certain resource-limited client machines. To be specific, the storage servers are capable of predicting future disk I/O access to guide fetching data in advance after analyzing the existing logs, and then they proactively push the prefetched data to relevant client file systems for satisfying future applications’ requests.
Purpose of effectively modeling disk I/O access patterns and accurately forwarding the prefetched data, the information about client file systems is piggybacked onto relevant I/O requests, and then transferred from client nodes to corresponding storage server nodes. Therefore, the client file systems running on the client nodes neither log I/O events nor conduct I/O access prediction; consequently, the thin client nodes can focus on performing necessary tasks with limited computing capacity and energy endurance.
Initiative prefetching scheme can be
applied in the distributed file system for a mobile cloud computing
environment, in which there are many tablet computers and smart terminals. The
current implementation of our proposed initiative prefetching scheme can
classify only two access patterns and support two corresponding prediction
algorithms for predicting future disk I/O access. We are planning to work on
classifying patterns for a wider range of application benchmarks in the future
by utilizing the horizontal visibility graph technique applying network delay
aware replica selection techniques for reducing network transfer time when
prefetching data among several replicas is another task in our future work.
Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites
With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences.
We propose a two-level
framework which according to the user’s available history on the site,
determines the best available privacy policy for the user’s images being
uploaded. Our solution relies on an image classification framework for image
categories which may be associated with similar policies, and on a policy
prediction algorithm to automatically generate a policy for each newly uploaded
image, also according to users’ social features. Over time, the generated
policies will follow the evolution of users’ privacy attitude. We provide the
results of our extensive evaluation over 5,000 policies, which demonstrate the
effectiveness of our system, with prediction accuracies over 90 percent.
1.2 INTRODUCTION
Images are now one of the key enablers of users’ connectivity. Sharing takes place both among previously established groups of known people or social circles (e. g., Google+, Flickr or Picasa), and also increasingly with people outside the users social circles, for purposes of social discovery-to help them identify new peers and learn about peers interests and social surroundings. However, semantically rich images may reveal contentsensitive information. Consider a photo of a students 2012 graduationceremony, for example.
It could be shared within a Google+ circle or Flickr group, but may unnecessarily expose the studentsBApos familymembers and other friends. Sharing images within online content sharing sites,therefore,may quickly leadto unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.
Most content sharing websites allow users to enter
their privacy preferences. Unfortunately, recent studies have shown that users
struggle to set up and maintain such privacy settings. One of the main reasons
provided is that given the amount of shared information this process can be
tedious and error-prone. Therefore, many have acknowledged the need of policy
recommendation systems which can assist users to easily and properly configure
privacy settings. However, existing proposals for automating privacy settings
appear to be inadequate to address the unique privacy needs of images due to the
amount of information implicitly carried within images, and their relationship
with the online environment wherein they are exposed.
1.3 LITRATURE SURVEY
TITLE NAME: SHEEPDOG: GROUP AND TAG RECOMMENDATION FOR FLICKR PHOTOS BY AUTOMATIC SEARCH-BASED LEARNING
AUTHOR: H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu,
PUBLISH: Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.
EXPLANATION:
Online photo albums have been prevalent in recent
years and have resulted in more and more applications developed to provide
convenient functionalities for photo sharing. In this paper, we propose a
system named SheepDog to automatically add photos into appropriate groups and
recommend suitable tags for users on Flickr. We adopt concept detection to
predict relevant concepts of a photo and probe into the issue about training
data collection for concept classification. From the perspective of gathering
training data by web searching, we introduce two mechanisms and investigate
their performances of concept detection. Based on some existing information
from Flickr, a ranking-based method is applied not only to obtain reliable
training data, but also to provide reasonable group/tag recommendations for
input photos. We evaluate this system with a rich set of photos and the results
demonstrate the effectiveness of our work.
TITLE NAME: CONNECTING CONTENT TO COMMUNITY IN SOCIAL MEDIA VIA IMAGE CONTENT, USER TAGS AND USER COMMUNICATION
AUTHOR: M. D. Choudhury, H. Sundaram, Y.-R. Lin, A. John, and D. D. Seligmann
PUBLISH: Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp.1238–1241.
EXPLANATION:
In this paper we develop a
recommendation framework to connect image content with communities in online
social media. The problem is important because users are looking for useful
feedback on their uploaded content, but finding the right community for
feedback is challenging for the end user. Social media are characterized by
both content and community. Hence, in our approach, we characterize images through
three types of features: visual features, user generated text tags, and social
interaction (user communication history in the form of comments). A
recommendation framework based on learning a latent space representation of the
groups is developed to recommend the most likely groups for a given image. The
model was tested on a large corpus of Flickr images comprising 15,689 images.
Our method outperforms the baseline method, with a mean precision 0.62 and mean
recall 0.69. Importantly, we show that fusing image content, text tags with
social interaction features outperforms the case of only using image content or
tags.
TITLE NAME: ANALYSING FACEBOOK FEATURES TO SUPPORT EVENT DETECTION FOR PHOTO-BASED FACEBOOK APPLICATIONS
AUTHOR: M. Rabbath, P. Sandhaus, and S. Boll,
PUBLISH: Proc. 2nd ACM Int. Conf. Multimedia Retrieval, 2012, pp. 11:1–11:8.
EXPLANATION:
Facebook witnesses an explosion of
the number of shared photos: With 100 million photo uploads a day it creates as
much as a whole Flickr each two months in terms of volume. Facebook has also
one of the healthiest platforms to support third party applications, many of
which deal with photos and related events. While it is essential for many
Facebook applications, until now there is no easy way to detect and link photos
that are related to the same events, which are usually distributed between
friends and albums. In this work, we introduce an approach that exploits
Facebook features to link photos related to the same event. In the current situation
where the EXIF header of photos is missing in Facebook, we extract
visual-based, tagged areas-based, friendship-based and structure-based
features. We evaluate each of these features and use the results in our
approach. We introduce and evaluate a semi-supervised probabilistic approach
that takes into account the evaluation of these features. In this approach we
create a lookup table of the initialization values of our model variables and
make it available for other Facebook applications or researchers to use. The
evaluation of our approach showed promising results and it outperformed the
other the baseline method of using the unsupervised EM algorithm in estimating
the parameters of a Gaussian mixture model. We also give two examples of the
applicability of this approach to help Facebook applications in better serving
the user.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Image content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users’ private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview on privacy related visual content available on the Web techniques to automatically detect private images, and to enable privacy-oriented image search.
To this end, we learn privacy classifiers trained on a large set of manually assessed Flickr photos, combining textual metadata of images with a variety of visual features. We employ the resulting classification models for specifically searching for private photos, and for diversifying query results to provide users with a better coverage of private and public content. Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings.
- One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone of policy recommendation systems which can assist users too easily and properly configure privacy settings.
2.1.1 DISADVANTAGES:
- Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations.
- Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content.
- The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.
2.2 PROPOSED SYSTEM:
We propose an Adaptive Privacy Policy Prediction (A3P) system which aims to provide users a hassle free privacy settings experience by automatically generating personalized policies. The A3P system handles user uploaded images, and factors in the following criteria that influence one’s privacy settings of images:
The impact of social environment and personal characteristics: Social context of users, such as their profile information and relationships with others may provide useful information regarding users’ privacy preferences. For example, users interested in photography may like to share their photos with other amateur photographers. Users who have several family members among their social contacts may share with them pictures related to family events. However, using common policies across all users or across users with similar traits may be too simplistic and not satisfy individual preferences.
Users may have drastically different opinions even on the same type of images. For example, a privacy adverse person may be willing to share all his personal images while a more conservative person may just want to share personal images with his family members. In light of these considerations, it is important to find the balancing point between the impact of social environment and users’ individual characteristics in order to predict the policies that match each individual’s needs.
The role of image’s content and metadata: In general, similar images often incur similar privacy preferences, especially when people appear in the images. For example, one may upload several photos of his kids and specify that only his family members are allowed to see these photos. He may upload some other photos of landscapes which he took as a hobby and for these photos, he may set privacy preference allowing anyone to view and comment the photos. Analyzing the visual content may not be sufficient to capture users’ privacy preferences. Tags and other metadata are indicative of the social context of the image, including where it was taken and why, and also provide a synthetic description of images, complementing the information obtained from visual content analysis.
2.2.1 ADVANTAGES:
- The A3P-core focuses on analyzing each individual user’s own images and metadata, while the A3P-Social offers a community perspective of privacy setting recommendations for a user’s potential privacy improvement.
- Our algorithm in A3P-core (that is now parameterized based on user groups and also factors in possible outliers), and a new A3P-social module that develops the notion of social context to refine and extend the prediction power of our system.
- We design the interaction flows between the two building blocks to balance the benefits from meeting personal characteristics and obtaining community advice.
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MYSQL Server
- Server : Apache Tomact Server
- Script : JSP Script
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
ADMIN:
USER:
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
ADMIN:
USER:
3.3 CLASS DIAGRAM:
ADMIN:
USER:
3.4 SEQUENCE DIAGRAM:
ADMIN:
USER:
3.5 ACTIVITY DIAGRAM:
ADMIN:
USER:
CHAPTER 4
4.0 IMPLEMENTATION:
A3P-CORE
There are two major components in A3P-core: (i) Image classification and (ii) Adaptive policy prediction. For each user, his/her images are first classified based on content and metadata. Then, privacy policies of each category of images are analyzed for the policy prediction. Adopting a two-stage approach is more suitable for policy recommendation than applying the common one-stage data mining approaches to mine both image features and policies together. Recall that when a user uploads a new image, the user is waiting for a recommended policy.
The two-stage approach allows the system to employ
the first stage to classify the new image and find the candidate sets of images
for the subsequent policy recommendation. As for the one-stage mining approach,
it would not be able to locate the right class of the new image because its
classification criteria need both image features and policies whereas the
policies of the new image are not available yet. Moreover, combining both image
features and policies into a single classifier would lead to a system which is
very dependent to the specific syntax of the policy. If a change in the
supported policies were to be introduced, the whole learning model would need
to change.
A3P-SOCIAL
The A3P-social employs a multi-criteria inference mechanism that generates representative policies by leveraging key information related to the user’s social context and his general attitude toward privacy. As mentioned earlier, A3Psocial will be invoked by the A3P-core in two scenarios. One is when the user is a newbie of a site, and does not have enough images stored for the A3P-core to infer meaningful and customized policies. The other is when the system notices significant changes of privacy trend in theuser’s social circle, which may be of interest for the user to possibly adjust his/her privacy settings accordingly. In what follows, we first present the types of social context considered by A3P-Social, and then present the policy recommendation process.
4.1 ALGORITHM
Our algorithm performs better for users with certain characteristics. Therefore, we study possible factors relevant to the performance of our algorithm. We used a least squares multiple regression analysis, regressing performance of the A3P-core to the following possible predictors:
4.2 MODULES:
WEB-BASED IMAGE SHARING SERVICES:
METADATA-BASED CLASSIFICATION:
CONTENT-BASED CLASSIFICATION:
ADAPTIVE
POLICY PREDICTION:
4.3 MODULE DESCRIPTION:
WEB-BASED IMAGE SHARING SERVICES:
Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information. We expected that frequency of sharing pictures and frequency of changing privacy settings would be significantly related to performance, but the results indicate that the frequency of social network use, frequency of uploading images and frequency of changing settings are not related to the performance our algorithm obtains with privacy settings predictions. This is a particularly useful result as it indicates that our algorithm will perform equally well for users who frequently use and share images on social networks as well as for users who may have limited access or limited information to share.
METADATA-BASED CLASSIFICATION:
We propose a hierarchical image
classification which classifies images first based on their contents and then
refine each category into subcategories based on their metadata. Images that do
not have metadata will be grouped only by content. Such a hierarchical
classification gives a higher priority to image content and minimizes the
influence of missing tags. Note that it is possible that some images are
included in multiple categories as long as they contain the typical content
features or metadata based classification groups’ images into subcategories
under aforementioned baseline categories.
The process consists of three main steps.
The third step is to find a subcategory that an image belongs to. This is an incremental procedure. At the beginning, the first image forms a subcategory as itself and the representative hypernyms of the image becomes the subcategory’s representative hypernyms. Then, we compute the distance between representative hypernyms of a new incoming image and each existing subcategory.
CONTENT-BASED CLASSIFICATION:
Our approach to content-based classification is based on an efficient and yet accurate image similarity approach. Specifically, our classification algorithm compares image signatures defined based on quantified and sanitized version of Haar wavelet transformation. For each image, the wavelet transform encodes frequency and spatial information related to image color, size, invariant transform, shape, texture, symmetry, etc. Then, a small number of coefficients are selected to form the signature of the image. The content similarity among images is then determined by the distance among their image signatures.
Our selected similarity criteria include texture, symmetry, shape (radial symmetry and phase congruency and SIFT. We also account for color and size. We set the system to start from five generic image classes: (a) explicit (e.g., nudity, violence, drinking etc), (b) adults, (c) kids, (d) scenery (e.g., beach, mountains), (e) animals. As a preprocessing step, we populate the five baseline classes by manually assigning to each class a number of images crawled from Google images, resulting in about 1,000 images per class. Having a large image data set beforehand reduces the chance of misclassification. Then, we generate signatures of all the images and store them in the database.
Our content classifier, we conducted some preliminary test to evaluate its accuracy. Precisely, we tested our classifier it against a ground-truth data set, Image-net.org. In Image-net, over 10 million images are collected and classified according to the wordnet structure. For each image class, we use the first half set of images as the training data set and classify the next 800 images. The classification result was recorded as correct if the synset’s main search term or the direct hypernym is returned as a class. The average accuracy of our classifier is above 94 percent.
ADAPTIVE POLICY PREDICTION:
The policy prediction algorithm provides a predicted policy of a newly uploaded image to the user for his/her reference. More importantly, the predicted policy will reflect the possible changes of a user’s privacy concerns. The prediction process consists of three main phases: (i) policy normalization; (ii) policy mining; and (iii) policy prediction. The policy normalization is a simple decomposition process to convert a user policy into a set of atomic rules in which the data (D) component is a single-element set.
We propose a hierarchical mining approach for policy mining. Our approach leverages association rule mining techniques to discover popular patterns in policies. Policy mining is carried out within the same category of the new image because images in the same category are more likely under the similar level of privacy protection. The basic idea of the hierarchical mining is to follow a natural order in which a user defines a policy.
Given an image, a user usually first
decides who can access the image, then thinks about what specific access rights
(e.g., view only or download) should be given, and finally refine the access
conditions such as setting the expiration date. Correspondingly, the
hierarchical mining first look for popular subjects defined by the user, then
look for popular actions in the policies containing the popular subjects, and
finally for popular conditions in the policies containing both popular subjects
and conditions.
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION
A3P-Social, we achieve a much higher accuracy, demonstrating that just simply considering privacy inclination is not enough, and that ”social-context” truly matters. Precisely the overall accuracy of A3P-social is above 95 percent. For 88.6 percent of the users, all predicted policies are correct, and the number of missed policies is 33 (for over 2,600 predictions). Also, we note that in this case, there is no significant difference across image types.
We compared the performance of the A3P-Social with alternative, popular, recommendation methods: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. In our case, the vectors are the users’ attributes defining their social profile. The algorithm using Cosine similarity scans all users profiles, computes Cosine similarity of the social contexts between the new user and the existing users. Then, it finds the top two users with the highest similarity score with the candidate user and feeds the associated images to the remaining functions in the A3P-core.
We have proposed an Adaptive Privacy
Policy Prediction (A3P) system that helps users automate the privacy policy
settings for their uploaded images. The A3P system provides a comprehensive
framework to infer privacy preferences based on the information available for a
given user. We also effectively tackled the issue of cold-start, leveraging
social context information. Our experimental study proves that our A3P is a practical
tool that offers significant improvements over current approaches to privacy.
Predicting Asthma-Related Emergency Department Visits Using Big Data
Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.
We introduce a novel method of using
multiple data sources for predicting the number of asthma related emergency
department (ED) visits in a specific area. Twitter data, Google search
interests and environmental sensor data were collected for this purpose. Our
preliminary findings show that our model can predict the number of asthma ED
visits based on near-real-time environmental and social media data with
approximately 70% precision. The results can be helpful for public health
surveillance, emergency department preparedness, and, targeted patient interventions.
1.2 INTRODUCTION:
Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.
Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.
Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.
Another notable non-traditional disease surveillance
systemhas been a data-aggregating tool called Google Flu Trends which uses
aggregated search data to estimate flu activity. Google Trends was quite
successful in its estimation of influenza-like illness. It is based on Google’s
search engine which tracks how often a particular search-term is entered
relative to the total search-volume across a particular area. This enables
access to the latest data from web search interest trends on a variety of
topics, including diseases like asthma. Air pollutants are known triggers for
asthma symptoms and exacerbations. The United States Environmental Protection
Agency (EPA) provides access to monitored air quality data collected at outdoor
sensors across the country which could be used as a data source for asthma
prediction. Meanwhile, as health reform progresses, the quantity and variety of
health records being made available electronically are increasing dramatically.
In contrast to traditional disease surveillance systems, these new data sources
have the potential to enable health organizations to respond to chronic
conditions, like asthma, in real time. This in turn implies that health
organizations can appropriately plan for staffing and equipment availability in
a flexible manner. They can also provide early warning signals to the people at
risk for asthma adverse events, and enable timely, proactive, and targeted
preventive and therapeutic interventions.
1.3 LITRATURE SURVEY:
USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION
AUTHOR: Kim, Eui-Ki, et al.
PUBLISH: PloS one vol. 8, no.7, e69305, 2013.
EXPLANATION:
Influenza epidemics arise through
the accumulation of viral genetic changes. The emergence of new virus strains
coincides with a higher level of influenza-like illness (ILI), which is seen as
a peak of a normal season. Monitoring the spread of an epidemic influenza in
populations is a difficult and important task. Twitter is a free social
networking service whose messages can improve the accuracy of forecasting
models by providing early warnings of influenza outbreaks. In this study, we
have examined the use of information embedded in the Hangeul Twitter stream to detect
rapidly evolving public awareness or concern with respect to influenza
transmission and developed regression models that can track levels of actual
disease activity and predict influenza epidemics in the real world. Our
prediction model using a delay mode provides not only a real-time assessment of
the current influenza epidemic activity but also a significant improvement in
prediction performance at the initial phase of ILI peak when prediction is of
most importance.
A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS
AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.
PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.
EXPLANATION:
Traditional disease surveillance is a very time
consuming reporting process. Cases of notifiable diseases are reported to the
different levels in the national health care system before actions can be
taken. But, early detection of disease activity followed by a rapid response is
crucial to reduce the impact of epidemics. To address this challenge,
alternative sources of information are investigated for disease surveillance.
In this paper, the relevance of twitter messages outbreak detection is
investigated from two directions. First, Twitter messages potentially related
to disease outbreaks are retrospectively searched and analyzed. Second,
incoming twitter messages are assessed with respect to their relevance for
outbreak detection. The studies show that twitter messages can be – to a
certain extent – highly relevant for early detecting hints to public health
threats. According to the law on German Protection against Infection Act
(Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance
relies on data from mandatory reporting of cases by physicians and
laboratories. They inform local county health departments (Landkreis) which in
turn report to state health departments (Land). At the end of the reporting
pipeline, the national surveillance institute (Robert Koch Institute) is
informed about the outbreak. It is clear that these different stages of
reporting take time and delay a timely reaction.
TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES
AUTHOR: Culotta, Aron.
PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.
EXPLANATION:
Rapid response to a health epidemic
is critical to reduce loss of life. Existing methods mostly rely on expensive
surveys of hospitals across the country, typically with lag times of one to two
weeks for influenza reporting, and even longer for less common diseases. In
response, there have been several recently proposed solutions to estimate a
population’s health from Internet activity, most notably Google’s Flu Trends service,
which correlates search term frequency with influenza statistics reported by
the Centers for Disease Control and Prevention (CDC). In this paper, we analyze
messages posted on the micro-blogging site Twitter.com to determine if a
similar correlation can be uncovered. We propose several methods to identify
influenza-related messages and compare a number of regression models to
correlate these messages with CDC statistics. Using over 500,000 messages
spanning 10 weeks, we find that our best model achieves a correlation of .78
with CDC statistics by leveraging a document classifier to identify relevant
messages.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.
Google messages and their relevance for
disease outbreak detection has been reported already that especially tweets are
useful to predict outbreaks such as a Norovirus outbreak at a university
analysed twitter news during the influenza epidemic 2009. They compared the use
of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed
the content of the tweets (ten content concepts) and validated twitter as a the
real time content. They analysed the data via Infovigil an infosurveillance
system by using an automated coding. To find out if there is a relationship
between automated and manual coding, the tweets were evaluated by a Pearson´s
correlation. Chew et al. found a significant correlation between both coding in
seven content concept it needs to be investigated whether this source might be
of relevance for detecting disease outbreaks in Germany. Therefore, only German
keywords are exploited to identify Twitter messages. Further, we are not only interested
in influenza-like illnesses as the studies available so far, but also in other
infectious diseases (e.g. Norovirus and Salmonella).
2.1.1 DISADVANTAGES:
Existing methods have a common format:
[username]
[text] [date time client]. The length is restricted to 140
characters. In terms of linguistics, each twitter user can write as he or she
likes. Thus, the variety reaches from complete sentences to listing of
keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote
topics and are primarily utilized by experienced users categories google according
to their contents in more details, google messages can • Provide information, •
Express opinions or • Report personal issues is provided, the authority of that
information cannot normally not be determined, so it might be unverified
information. Opinions are often expressed with humor or sarcasm and may be
highly contradictive in the emotions that are expressed.
2.2 PROPOSED SYSTEM:
Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.
Research studies have explored the use
of novel data sources to propose rapid, cost-effective health status
surveillance methodologies. Some of the early studies rely on document
classification suggesting that Twitter data can be highly relevant for early
detection of public health threats. Others employ more complex linguistic
analysis, such as the Ailment Topic Aspect Model which is useful for syndrome
surveillance. This type of analysis is useful for demonstrating the
significance of social media as a promising new data source for health
surveillance. Other recent studies have linked social media data with real
world disease incidence to generate actionable knowledge useful for making
health care decisions. These include which analyzed Twitter messages related to
influenza and correlated them with reported CDC statistics validated Twitter as
a real-time content, sentiment, and public attention trend-tracking tool.
Collier employed supervised classifiers (SVM and Naive Bayes) to classify
tweets into four self-reported protective behavior categories. This study adds
to evidence supporting a high degree of correlation between pre-diagnostic
social media signals and diagnostic influenza case data.
2.2.1 ADVANTAGES:
Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.
The main contributions of this work are:
• Analysis of tweets with respect to their relevance for disease surveillance,
• Content analysis and content classification of tweets,
• Linguistic analysis of disease-reporting twitter messages,
• Recommendations on search patterns for tweet
search in the context of disease surveillance.
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MYSQL Server
- Server : Apache Tomact Server
- Script : JSP Script
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM:
3.2 DATAFLOW DIAGRAM:
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
3.5 ACTIVITY DIAGRAM:
Alert Email |
Login |
Filter Tweet |
NO |
Accept |
Asthma Tweets |
New Tweet |
Tweet |
Login |
Friend Follow |
Friends list |
CHAPTER 4
4.0 IMPLEMENTATION:
DISEASE CONTROL AND PREVENTION (CDC):
Current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments [4]. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that are not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection.
Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information.
4.1 ALGORITHM:
MACHINE LEARNING ALGORITHMS:
Our research objective is to leverage
social media, internet search, and environmental air quality data to estimate
ED visits for asthma in a relatively discrete geographic area (a metropolitan
area) within a relatively short time period (days). To this end, we have
gathered asthma related ED visits data, social media data from Twitter,
internet users’ search interests from Google and pollution sensor data from the
EPA, all from the same geographic area and time period, to create a model for
predicting asthma related ED visits. This work is different from extant studies
that typically predict the spread of contagious diseases using social media
such as Twitter. Unlike influenza or other viral diseases, asthma is a
non-communicable health condition and we demonstrate the utility and value of
linking big data from diverse sources in developing predictive models for
non-communicable diseases with a specific focus on asthma.
4.2 MODULES:
EMERGENCY DEPARTMENT VISITS:
ENVIRONMENTAL SENSORS (EMR):
OUR PREDICTION SENSOR DATA:
ASTHMA
PREDICTION RESULTS:
4.3 MODULE DESCRIPTION:
EMERGENCY DEPARTMENT VISITS:
We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers.
The prediction of population- and
individual-level risk for asthma adverse events using accurate and timely
surveillance data could guide timely and targeted interventions, to reduce the
societal burden of asthma. At the population level, current national asthma
disease surveillance programs rely on weekly reports to the Centers for Disease
Control and Prevention (CDC) of data collected from various local resources by
state health departments. Notoriously, such data have a lag-time of weeks,
therefore providing retrospective information that is not amenable to proactive
and timely preventive interventions. At the individual level, known predictors
of asthma ED visits and hospitalizations include past acute care utilization,
medication use, and sociodemographic characteristics.
ENVIRONMENTAL SENSORS (EMR):
Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment.
For instance, resources might be
prioritized to first serve patients at high risk for an asthma ED visit within
2 weeks of data collection, while being safely deferred for patients with a
later predicted high-risk period. Novel sources of timely data on population-
and individual-level asthma activities are needed to provide additional temporal
and geographical granularity to asthma risk stratification. Short of collecting
information directly from individual patients (a time- and resource-intensive
endeavor), readily available public data will have to be repurposed
intelligently to provide the required information. There has been increasing
interest in gathering non-traditional, digital information to perform disease
surveillance. These include diverse datasets such as those stemming from social
media, internet search, and environmental data. Twitter is an online social
media platform that enables users to post and read 140-character messages
called “tweets”. It is a popular data source for disease surveillance using
social media since it can provide nearly instant access to real-time social opinions.
More importantly, tweets are often tagged by geographic location and time
stamps potentially providing information for disease surveillance.
OUR PREDICTION SENSOR DATA:
We first analyzed the association between the asthma-related ED visits and data from Twitter, Google trends, and Air Quality sensors, using the Pearson correlation coefficient. We also examined the association between asthma-related tweet counts and ED visit counts for abdominal pain/constipation patients, to control for non-asthma-specific variations in ED visit counts. Then, we designed and implemented a prediction model to estimate the incidence of asthma ED visits at CMC using a combination of independent variables from the above data sources.
Twitter offers streaming APIs to give developers and researchers low latency access to its global stream of data. Public streams, which can provide access to the public data flowing through Twitter, were used in this study. Studies have estimated that using Twitter’s Streaming API, researchers can expect to receive 1% of the tweets in near real-time. Twitter4j, an unofficial Java library for the Twitter API, was used to access tweet information from the Twitter Streaming API.
Two different Twitter data sets were collected in this study:
(1) The general twitter stream: a simple collection of JSON grabbed from the general Twitter stream. The general tweet counts were used to estimate the Twitter population and further normalize asthma tweet counts.
(2) The asthma-related stream: to collect
only tweets containing any of 19 related keywords that were suggested by our
clinical collaborators from PCCI. The asthma stream is limited to 1% of full
tweets as well.
ASTHMA PREDICTION RESULTS:
Our results from the correlation analysis, asthma tweets, CO, NO2 and PM2.5 were selected as inputs into our prediction model. We are only reporting results for the Decision Tree and Artificial Neural Networks (ANN) techniques, as the Naive Bayes and SVM techniques did not yield good prediction results. First, backward feature selection algorithm was used to examine if the addition of Twitter data would improve the prediction. As shown in Table VI, combining air quality data with tweets resulted in higher prediction accuracy. Additionally, we evaluated prediction precision. Given that our prediction task is for a three-way classification, each technique resulted in different prediction and/or precision in different classes (Table VII). Decision Tree performed well in predicting the “High” class, while ANN, after Adaptive Boosting, worked well for the “Low” class. Stacking the two techniques performed well for the “Medium” class.
The results of our analysis are
promising because they perform with a fairly high level of accuracy overall. As
noted in the introduction, traditional asthma ED visit models are useful for
predicting events in a three month window and have an accuracy of approximately
70%. It is to be noted that “traditional models” estimate a risk score for
asthma ED visit for each individual patient, whereas our “Twitter/
Environmental data model” predicts the risk for a daily number of ED visits
being High, Low, or Medium. The former is an individual-level risk model, while
the latter is a population-level risk model. Our population-level asthma risk
prediction model has the potential for complementing current individual-level
models, and may lead to a shorter time window and better accuracy of
prediction. This in turn has implications for better planning and proactive
treatment in specific geo-locations at specific time periods.
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION AND FUTURE:
In this study, we have provided preliminary evidence that social media and environmental data can be leveraged to accurately predict asthma ED visits at a population level. We are in the process of confirming these preliminary findings by collecting larger clinical datasets across different seasons and multiple hospitals. Our continued work is focused on extending this research to propose a temporal prediction model that analyzes the trends in tweets and air quality index changes, and estimates the time lag between these changes and the number of asthma ED visits.
We also are collecting air quality index
data over a longer time period to examine the effects of seasonal variations.
In addition, we would like to explore the effect of relevant data from other
types of social media interactions, e.g. blogs and discussion forums, on our
asthma visit prediction model. Additional studies are needed to examine how
combining real-time or near-real-time social media and environmental data with
more traditional data might affect the performance and timing of current
individual-level prediction models for asthma, and eventually, for other
chronic conditions. In future projects, we intend to extend our work to
diseases with geographical and temporal variability, e.g., COPD and diabetes.
Performing Initiative Data Prefetching in Distributed File Systems for Cloud Computing
An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.
Finally, the prefetched data can be
pushed to the relevant client machine from the storage server. Through a series
of evaluation experiments with a collection of application benchmarks, we have
demonstrated that our presented initiative prefetching technique can benefit
distributed file systems for cloud environments to achieve better I/O
performance. In particular, configurationlimited client machines in the cloud
are not responsible for predicting I/O access operations, which can definitely
contribute to preferable system performance on them.
1.2 INTRODUCTION
The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.
However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.
On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.
Furthermore, considering only disk I/O events can
reveal the disk tracks that can offer critical information to perform I/O
optimization tactics certain prefetching techniques have been proposed in
succession to read the data on the disk in advance after analyzing disk I/O
traces. But, this kind of prefetching only works for local file systems, and
the prefetched data iscached on the local machine to fulfill the application’s
I/O requests passively in brief, although block access history reveals the
behavior of disk tracks, there are no prefetching schemes on storage servers in
a distributed file system for yielding better system performance. And the
reason for this situation is because of the difficulties in modeling the block
access history to generate block access patterns and deciding the destination
client machine for driving the prefetched data from storage servers.
1.3 LITRATURE SURVEY
PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS
AUTHOR: J. Liao, Y. Ishikawa
PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.
EXPLANATION:
This paper presents PARTE, a
prototype parallel file system with active/standby configured metadata servers
(MDSs). PARTE replicates and distributes a part of files’ metadata to the
corresponding metadata stripes on the storage servers (OSTs) with a per-file
granularity, meanwhile the client file system (client) keeps certain sent
metadata requests. If the active MDS has crashed for some reason, these client
backup requests will be replayed by the standby MDS to restore the lost
metadata. In case one or more backup requests are lost due to network problems
or dead clients, the latest metadata saved in the associated metadata stripes
will be used to construct consistent and up-to-date metadata on the standby
MDS. Moreover, the clients and OSTs can work in both normal mode and recovery
mode in the PARTE file system. This differs from conventional active/standby
configured MDSs parallel file systems, which hang all I/O requests and metadata
requests during restoration of the lost metadata. In the PARTE file system,
previously connected clients can continue to perform I/O operations and
relevant metadata operations, because OSTs work as temporary MDSs during that
period by using the replicated metadata in the relevant metadata stripes.
Through examination of experimental results, we show the feasibility of the
main ideas presented in this paper for providing high availability metadata
service with only a slight overhead effect on I/O performance. Furthermore,
since previously connected clients are never hanged during metadata recovery,
in contrast to conventional systems, a better overall I/O data throughput can
be achieved with PARTE.
EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS
AUTHOR: P. Sehgal, V. Tarasov, E. Zadok
PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.
EXPLANATION:
Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.
FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS
AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al
PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.
EXPLANATION:
WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
The file system deployed in a
distributed computing environment is called a distributed file system, which is
always used to be a backend storage system to provide I/O services for various
sorts of data intensive applications in cloud computing environments. In fact,
the distributed file system employs multiple distributed I/O devices by
striping file data across the I/O nodes, and uses high aggregate bandwidth to
meet the growing I/O requirements of distributed and parallel scientific
applications benchmark to create OLTP workloads, since it is able to create
similar OLTP workloads that exist in real systems. All the configured client
file systems executed the same script, and each of them run several threads
that issue OLTP requests. Because Sysbench requires MySQL installed as a
backend for OLTP workloads, we configured mysqld process to 16 cores of storage
servers. As a consequence, it is possible to measure the response time to the
client request while handling the generated workloads.
2.1.1 DISADVANTAGES:
- Network delay in numerically and geographically remote file system access
- Mobile devices generally have limited processing power, battery life and storage
2.2 PROPOSED SYSTEM:
Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.
This paper makes the following two contributions:
1) Chaotic time series prediction and linear
regression prediction to forecast disk I/O access. We have modeled the disk I/O
access operations, and classified them into two kinds of access patterns, i.e.
the random access pattern and the sequential access pattern. Therefore, in
order to predict the future I/O access that belongs to the different access
patterns as accurately as possible (note that the future I/O access indicates
what data will be requested in the near future), two prediction algorithms
including the chaotic time series prediction algorithm and the linear
regression prediction algorithm have been proposed respectively. 2) Initiative
data prefetching on storage servers. Without any intervention from client file
systems except for piggybacking their information onto relevant I/O requests to
the storage servers. The storage servers are supposed to log disk I/O access
and classify access patterns after modeling disk I/O events. Next, by properly
using two proposed prediction algorithms, the storage servers can predict the
future disk I/O access to guide prefetching data. Finally, the storage servers
proactively forward the prefetched data to the relevant client file systems for
satisfying future application’s requests.
2.2.1 ADVANTAGES:
- The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.
- Less file operations via batched I/O requests through prefetching
- Cloud computing offers an illusion of
infinite computing resources
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
JAVA
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Script : Java Script
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM
UML DIAGRAMS:
3.2 USE CASE DIAGRAM:
3.3 CLASS DIAGRAM:
3.4 SEQUENCE DIAGRAM:
3.5
ACTIVITY DIAGRAM:
CHAPTER 4
4.0 IMPLEMENTATION:
I/O
ACCESS PREDICTION
4.1 ALGORITHM
MARKOV MODEL PREDICTION ALGORITHM
LINEAR
PREDICTION ALGORITHM
4.2 MODULES:
SERVER CLIENT MODULE:
DISTRIBUTED FILE SYSTEMS:
INITIATIVE DATA PREFETCHING:
ANALYSIS
OF PREDICTIONS:
4.3 MODULE DESCRIPTION:
SERVER CLIENT MODULE:
DISTRIBUTED FILE SYSTEMS:
INITIATIVE DATA PREFETCHING:
ANALYSIS
OF PREDICTIONS:
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION AND FUTURE WORK:
We have proposed, implemented and evaluated an initiative data prefetching approach on the storage servers for distributed file systems, which can be employed as a backend storage system in a cloud environment that may have certain resource-limited client machines. To be specific, the storage servers are capable of predicting future disk I/O access to guide fetching data in advance after analyzing the existing logs, and then they proactively push the prefetched data to relevant client file systems for satisfying future applications’ requests.
Purpose of effectively modeling disk I/O access patterns and accurately forwarding the prefetched data, the information about client file systems is piggybacked onto relevant I/O requests, and then transferred from client nodes to corresponding storage server nodes. Therefore, the client file systems running on the client nodes neither log I/O events nor conduct I/O access prediction; consequently, the thin client nodes can focus on performing necessary tasks with limited computing capacity and energy endurance.
Initiative prefetching scheme can be
applied in the distributed file system for a mobile cloud computing
environment, in which there are many tablet computers and smart terminals. The
current implementation of our proposed initiative prefetching scheme can
classify only two access patterns and support two corresponding prediction
algorithms for predicting future disk I/O access. We are planning to work on
classifying patterns for a wider range of application benchmarks in the future
by utilizing the horizontal visibility graph technique applying network delay
aware replica selection techniques for reducing network transfer time when
prefetching data among several replicas is another task in our future work.
Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter
It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.
This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.
This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.
These results
can help further reveal IP spoofing, which has been studied for long but never
well understood. Though PIT cannot work in all the spoofing attacks, it may be
the most useful mechanism to trace spoofers before an Internet-level traceback
system has been deployed in real.
1.2 INTRODUCTION
IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.
To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.
Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.
In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.
Our work has the following contributions:
1) This is the first article known which
deeply investigates path backscatter messages. These messages are valuable to
help understand spoofing activities. Though Moore et al. [8] has exploited
backscatter messages, which are generated by the targets of spoofing messages, to
study Denial of Services (DoS), path backscatter messages, which are sent by
intermediate devices rather than the targets, have not been used in traceback. 2)
A practical and effective IP traceback solution based on path backscatter
messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of
existing IP traceback mechanisms and actually is already in force. Though given
the limitation that path backscatter messages are not generated with stable
possibility, PIT cannot work in all the attacks, but it does work in a number
of spoofing activities. At least it may be the most useful traceback mechanism
before an AS-level traceback system has been deployed in real. 3) Through
applying PIT on the path backscatter dataset, a number of locations of spoofers
are captured and presented. Though this is not a complete list, it is the first
known list disclosing the locations of spoofers.
1.3 LITRATURE SURVEY
DEFENSE AGAINST SPOOFED IP TRAFFIC USING HOP-COUNT FILTERING
PUBLICATION: IEEE/ACM Trans. Netw., vol. 15, no. 1, pp. 40–53, Feb. 2007.
AUTHORS: H. Wang, C. Jin, and K. G. Shin
EXPLANATION:
IP spoofing has often
been exploited by Distributed Denial of Service (DDoS) attacks to: 1)conceal
flooding sources and dilute localities in flooding traffic, and 2)coax
legitimate hosts into becoming reflectors, redirecting and amplifying flooding
traffic. Thus, the ability to filter spoofed IP packets near victim servers is
essential to their own protection and prevention of becoming involuntary DoS
reflectors. Although an attacker can forge any field in the IP header, he
cannot falsify the number of hops an IP packet takes to reach its destination.
More importantly, since the hop-count values are diverse, an attacker cannot
randomly spoof IP addresses while maintaining consistent hop-counts. On the
other hand, an Internet server can easily infer the hop-count information from
the Time-to-Live (TTL) field of the IP header. Using a mapping between IP
addresses and their hop-counts, the server can distinguish spoofed IP packets
from legitimate ones. Based on this observation, we present a novel filtering
technique, called Hop-Count Filtering (HCF)-which builds an accurate
IP-to-hop-count (IP2HC) mapping table-to detect and discard spoofed IP packets.
HCF is easy to deploy, as it does not require any support from the underlying
network. Through analysis using network measurement data, we show that HCF can
identify close to 90% of spoofed IP packets, and then discard them with little
collateral damage. We implement and evaluate HCF in the Linux kernel,
demonstrating its effectiveness with experimental measurements
DYNAMIC PROBABILISTIC PACKET MARKING FOR EFFICIENT IP TRACEBACK
PUBLICATION: Comput. Netw., vol. 51, no. 3, pp. 866–882, 2007.
AUTHORS: J. Liu, Z.-J. Lee, and Y.-C. Chung
EXPLANATION:
Recently, denial-of-service
(DoS) attack has become a pressing problem due to the lack of an efficient
method to locate the real attackers and ease of launching an attack with
readily available source codes on the Internet. Traceback is a subtle scheme to
tackle DoS attacks. Probabilistic packet marking (PPM) is a new way for
practical IP traceback. Although PPM enables a victim to pinpoint the
attacker’s origin to within 2–5 equally possible sites, it has been shown that
PPM suffers from uncertainty under spoofed marking attack. Furthermore, the
uncertainty factor can be amplified significantly under distributed DoS attack,
which may diminish the effectiveness of PPM. In this work, we present a new
approach, called dynamic probabilistic packet marking (DPPM), to further
improve the effectiveness of PPM. Instead of using a fixed marking probability,
we propose to deduce the traveling distance of a packet and then choose a
proper marking probability. DPPM may completely remove uncertainty and enable
victims to precisely pinpoint the attacking origin even under spoofed marking
DoS attacks. DPPM supports incremental deployment. Formal analysis indicates
that DPPM outperforms PPM in most aspects.
FLEXIBLE DETERMINISTIC PACKET MARKING: AN IP TRACEBACK SYSTEM TO FIND THE REAL SOURCE OF ATTACKS
PUBLICATION: EEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.
AUTHORS: Y. Xiang, W. Zhou, and M. Guo
EXPLANATION:
IP traceback is the
enabling technology to control Internet crime. In this paper we present a novel
and practical IP traceback system called Flexible Deterministic Packet Marking
(FDPM) which provides a defense system with the ability to find out the real
sources of attacking packets that traverse through the network. While a number
of other traceback schemes exist, FDPM provides innovative features to trace
the source of IP packets and can obtain better tracing capability than others.
In particular, FDPM adopts a flexible mark length strategy to make it
compatible to different network environments; it also adaptively changes its
marking rate according to the load of the participating router by a flexible
flow-based marking scheme. Evaluations on both simulation and real system
implementation demonstrate that FDPM requires a moderately small number of
packets to complete the traceback process; add little additional load to
routers and can trace a large number of sources in one traceback process with
low false positive rates. The built-in overload prevention mechanism makes this
system capable of achieving a satisfactory traceback result even when the
router is heavily loaded. It has been used to not only trace DDoS attacking
packets but also enhance filtering attacking traffic.
CHAPTER 2
2.0 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM:
Existing methods of the IP marking approach is that routers probabilistically write some encoding of partial path information into the packets during forwarding. A basic technique, the edge sampling algorithm, is to write edge information into the packets. This scheme reserves two static fields of the size of IP address, start and end, and a static distance field in each packet. Each router updates these fields as follows. Each router marks the packet with a probability. When the router decides to mark the packet, it writes its own IP address into the start field and writes zero into the distance field. Otherwise, if the distance field is already zero which indicates its previous router marked the packet, it writes its own IP address into the end field, thus represents the edge between itself and the previous routers.
Previous router doesn’t mark the packet,
then it always increments the distance field. Thus the distance field in the
packet indicates the number of routers the packet has traversed from the router
which marked the packet to the victim. The distance field should be updated
using a saturating addition, meaning that the distance field is not allowed to
wrap. The mandatory increment of the distance field is used to avoid spoofing
by an attacker. Using such a scheme, any packet written by the attacker will
have distance field greater than or equal to the length of the real attack path
a router false positive if it is in the reconstructed attack graph but not in
the real attack graph. Similarly we call a router false negative if it is in
the true attack graph but not in the reconstructed attack graph. We call a
solution to the IP traceback problem robust if it has very low rate of false
negatives and false positives.
2.1.1 DISADVANTAGES:
- Existing approach has a very high computation overhead for the victim to reconstruct the attack paths, and gives a large number of false positives when the denial-of-service attack originates from multiple attackers.
- Existing approach can require days of computation to reconstruct the attack paths and give thousands of false positives even when there are only 25 distributed attackers. This approach is also vulnerable to compromised routers.
- If a router is compromised, it can forge markings from other uncompromised routers and hence lead the reconstruction to wrong results. Even worse, the victim will not be able to tell a router is compromised just from the information in the packets it receives problem.
2.2 PROPOSED SYSTEM:
We propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.
We present PIT, which tracks the
location of the spoofers based on path backscatter messages together with the
topology and routing information. We discuss how to apply PIT when both
topology and routing are known, or only topology is known, or neither are known
respectively. We also present two effective algorithms to apply PIT in large
scale networks. In the following section, at first we show the statistical
results on path backscatter messages. Then we evaluate the two key mechanisms
of PIT which work without routing information. At last, we give the tracking
result when applying PIT on the path backscatter message dataset: a number of
ASes in which spoofers are found.
2.2.1 ADVANTAGES:
1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback.
2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real.
3) Through applying PIT on the path backscatter
dataset, a number of locations of spoofers are captured and presented. Though
this is not a complete list, it is the first known list disclosing the
locations of spoofers.
2.3 HARDWARE & SOFTWARE REQUIREMENTS:
2.3.1 HARDWARE REQUIREMENT:
v Processor – Pentium –IV
- Speed –
1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
2.3.2 SOFTWARE REQUIREMENTS:
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Document : MS-Office 2007
CHAPTER 3
3.0 SYSTEM DESIGN:
Data Flow Diagram / Use Case Diagram / Flow Diagram:
- The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
- The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
- DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
- DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
NOTATION:
SOURCE OR DESTINATION OF DATA:
External sources or destinations, which may be people or organizations or other entities
DATA SOURCE:
Here the data referenced by a process is stored and retrieved.
PROCESS:
People, procedures or devices that produce data’s in the physical component is not identified.
DATA FLOW:
Data moves in a specific direction from an origin to
a destination. The data flow is a “packet” of data.
There are several common modeling rules when creating DFDs:
- All processes must have at least one data flow in and one data flow out.
- All processes should modify the incoming data, producing new forms of outgoing data.
- Each data store must be involved with at least one data flow.
- Each external entity must be involved with at least one data flow.
- A data flow must be attached to at least one process.
3.1 ARCHITECTURE DIAGRAM
3.2 DATAFLOW DIAGRAM:
LEVEL 1
Base station |
View request |
Router check the node |
Message send via router |
LEVEL 2
Node |
Exists |
Send request |
Receive message |
Check IP Address & check verification node |
Clear Spoofing Attacks |
LEVEL 3
Router |
IP Address |
Router check the each node |
Check verification same/diff node to each data |
Response to node |
Detect Spoofing Origin and send message to original node |
3.2.1 UML DIAGRAMS:
3.2.2 USE CASE DIAGRAM:
Base station |
Router |
Create message |
View request |
Message send via router |
Router check each node |
Check verification same/diff node to each data |
Response to client |
Detect spoofing origin |
Send message |
Node |
Send request |
3.2.3 CLASS DIAGRAM:
Node |
IP Adress |
Send request |
View message () |
Base station |
IP Address |
View request |
Send message via router |
Socket connection () () |
Send message () () |
Router |
IP Address |
Router check the each node |
Detectsppofing() () |
Receive message () |
Response to nde |
Send message() () |
3.2.4 SEQUENCE DIAGRAM:
Connection established |
Send encoded data |
Check verification |
Form routing |
Routing Finished |
Detect Spoofing |
Connection terminate |
Source |
Base station |
Destination |
Establish communication |
Connection established |
Receiving Ack |
Data received |
Routing Success |
3.2. ACTIVITY DIAGRAM:
Node |
Router |
Check |
Check verification same/diff node to each data |
Router check the each node |
Clear jamming and send message to node |
Response to client |
Yes Start msg receive |
No |
IP Address & View request |
IP Address |
Send request |
Message Received |
Base station |
Message send via router |
Router check the node |
CHAPTER 4
4.0 IMPLEMENTATION:
4.1 ALGORITHM
We designed an algorithm specified in
Fig. 6. This algorithm first finds a shortest path from r to od.
From the second vertex along the path, it checks if the removal of the vertex
can break r and od. Whenever such a vertex c is found,
removing the vertex from G, and the set containing all the verticals
which are still connected with r is just the suspect set.
4.2 MODULES:
NETWORK SECURITY:
DENIAL OF SERVICE (DOS):
PATH BACKSCATTER:
IP SPOOFING METHOD:
IP
TRACEBACK METHOD:
4.3 MODULE DESCRIPTION:
NETWORK SECURITY:
Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.
DENIAL OF SERVICE (DOS):
In computing, a denial-of-service (DoS) attack is an attempt to make a machine or network resource unavailable to its intended users, such as to temporarily or indefinitely interrupt or suspend services of a host connected to the Internet. A distributed denial-of-service (DDoS) is where the attack source is more than one, often thousands of, unique IP addresses. It is analogous to a group of people crowding the entry door or gate to a shop or business, and not letting legitimate parties enter into the shop or business, disrupting normal operations.
Criminal perpetrators of DoS attacks often target sites or services hosted on high-profile web servers such as banks, credit card payment gateways; but motives of revenge, blackmail or activism can be behind other attacks. A denial-of-service attack is characterized by an explicit attempt by attackers to prevent legitimate users of a service from using that service. There are two general forms of DoS attacks: those that crash services and those that flood services.
The most serious attacks are distributed and in many or most cases involve forging of IP sender addresses (IP address spoofing) so that the location of the attacking machines cannot easily be identified, nor can filtering be done based on the source address.
PATH BACKSCATTER:
We presented a preliminary statistical
result on path backscatter messages and discussed it is possible to trace
spoofers based on the messages. However, the generation and collection of path
backscatter messages are not well investigated, and the traceback mechanisms
are not designed. In this article, we make a thorough analysis on path
backscatter messages, present the traceback mechanisms and give the traceback
results. 2. Each message contains the source address of the reflecting device,
and the IP header of the original packet. Thus, from each path backscatter, we
can get 1) the IP address of the reflecting device which is on the path from
the attacker to the destination of the spoofing packet; 2) the IP address of
the original destination of the spoofing packet. The original IP header also
contains other valuable information, e.g., the remaining TTL of the spoofing
packet. Note that due to some network devices may perform address rewrite
(e.g., NAT), the original source address and the destination address may be
different.
IP SPOOFING METHOD:
Our tracking mechanisms actually have two limitations. The first is the network topology and mapping from addresses of r and od must be known. The second is the tracking is actually performed based on loose assumptions on paths. Thus, only when path backscatter messages are from very special vertex, i.e., stub AS, the spoofer can be accurately located. In this section, we discuss how to break these limitations through using other information contained in path backscatter messages.
We found there are three special types of path backscatter messages which are more useful for tracing spoofers:
1) The path backscatter messages whose original hop count is 0 or 1. Such messages are generated 1 or 2 hops from the spoofers. Very possibly they are from the gateway of the spoofer.
2) The path backscatter messages whose type is ‘Redirect’. Such messages must be from a gateway of the spoofer.
3) The path backscatter messages whose
original destination is a private address or unallocated address. Such messages
are typically generated by the first DFZ router on the path from the spoofer to
the original destination, e.g., the egress router of the AS in which the
spoofer resides. Though such path backscatter messages are generated in very
special cases, they are not rare. Especially, there are a large number of path
backscatter messages whose original destination is a private address.
IP TRACEBACK METHOD:
PIT is very different from any existing traceback mechanism. The main difference is the generation of path backscatter message is not of a certain probability. Thus, we separate the evaluation into 3 parts: the first is the statistical results on path backscatter messages; the second is the evaluation on the traceback mechanisms presented in considering uncertainness of path backscatter generation, since effectiveness of the mechanisms is actually determined by the structure features of the networks; the last is the result of performing the traceback mechanisms on the path backscatter message dataset.
In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known. We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.
CHAPTER 5
5.0 SYSTEM STUDY:
5.1 FEASIBILITY STUDY:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
- ECONOMICAL FEASIBILITY
- TECHNICAL FEASIBILITY
- SOCIAL FEASIBILITY
5.1.1 ECONOMICAL FEASIBILITY:
This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.
5.1.2 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.
5.1.3 SOCIAL FEASIBILITY:
The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.
5.2 SYSTEM TESTING:
Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months.
This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.
5.2.1 UNIT TESTING:
Description | Expected result |
Test for application window properties. | All the properties of the windows are to be properly aligned and displayed. |
Test for mouse operations. | All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions. |
A program represents the
logical elements of a system. For a program to run satisfactorily, it must
compile and test data correctly and tie in properly with other programs.
Achieving an error free program is the responsibility of the programmer.
Program testing checks
for two types
of errors: syntax
and logical. Syntax error is a
program statement that violates one or more rules of the language in which it
is written. An improperly defined field dimension or omitted keywords are
common syntax errors. These errors are shown through error message generated by
the computer. For Logic errors the programmer must examine the output
carefully.
5.1.2 FUNCTIONAL TESTING:
Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.
Description | Expected result |
Test for all modules. | All peers should communicate in the group. |
Test for various peer in a distributed network framework as it display all users available in the group. | The result after execution should give the accurate result. |
5.1. 3 NON-FUNCTIONAL TESTING:
The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:
- Load testing
- Performance testing
- Usability testing
- Reliability testing
- Security testing
5.1.4 LOAD TESTING:
An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.
Description | Expected result |
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received. | Should designate another active node as a Server. |
5.1.5 PERFORMANCE TESTING:
Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.
Description | Expected result |
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management. | Should handle large input values, and produce accurate result in a expected time. |
5.1.6 RELIABILITY TESTING:
The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.
Description | Expected result |
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application. | In case of failure of the server an alternate server should take over the job. |
5.1.7 SECURITY TESTING:
Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.
Description | Expected result |
Checking that the user identification is authenticated. | In case failure it should not be connected in the framework. |
Check whether group keys in a tree are shared by all peers. | The peers should know group key in the same group. |
5.1.8 WHITE BOX TESTING:
White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.
Description | Expected result |
Exercise all logical decisions on their true and false sides. | All the logical decisions must be valid. |
Execute all loops at their boundaries and within their operational bounds. | All the loops must be finite. |
Exercise internal data structures to ensure their validity. | All the data structures must be valid. |
5.1.9 BLACK BOX TESTING:
Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.
Description | Expected result |
To check for incorrect or missing functions. | All the functions must be valid. |
To check for interface errors. | The entire interface must function normally. |
To check for errors in a data structures or external data base access. | The database updation and retrieval must be done. |
To check for initialization and termination errors. | All the functions and data structures must be initialized properly and terminated normally. |
All
the above system testing strategies are carried out in as the development,
documentation and institutionalization of the proposed goals and related
policies is essential.
CHAPTER 6
6.0 SOFTWARE DESCRIPTION:
6.1 JAVA TECHNOLOGY:
Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all of the following buzzwords:
- Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure
With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
6.2 THE JAVA PLATFORM:
A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
- The Java Virtual Machine (Java VM)
- The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
6.3 WHAT CAN JAVA TECHNOLOGY DO?
The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.
A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:
- The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
- Applets: The set of conventions used by applets.
- Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
- Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
- Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
- Software components: Known as JavaBeansTM, can plug into existing component architectures.
- Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
- Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.
6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?
We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:
- Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
- Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
- Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
- Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
- Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
- Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
- Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.
6.5 ODBC:
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.
The advantages
of this scheme are so numerous that you are probably thinking there must be
some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors
make the charge that it is too slow. Microsoft has always claimed that the
critical factor in performance is the quality of the driver software that is
used. In our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never
match the speed of pure assembly language. Maybe not, but the compiler (or
ODBC) gives you the opportunity to write cleaner programs, which means you
finish sooner. Meanwhile, computers get faster every year.
6.6 JDBC:
In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.
The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.
6.7 JDBC Goals:
Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:
SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.
SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.
JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.
- Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.
- Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.
- Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.
- Keep the common cases simple
Because more often than not, the usual SQL calls
used by the programmer are simple SELECT’s,
INSERT’s,
DELETE’s
and UPDATE’s,
these queries should be simple to perform with JDBC. However, more complex SQL
statements should also be possible.
Finally we decided to precede the implementation using Java Networking.
And for dynamically updating the cache table we go for MS Access database.
Java ha two things: a programming language and a platform.
Java is a high-level programming language that is all of the following
Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted Multithreaded
Robust Dynamic Secure
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
Java Program |
Compilers |
Interpreter |
My Program |
6.7 NETWORKING TCP/IP STACK:
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.
IP datagram’s:
The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.
UDP:
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.
TCP:
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.
Network address:
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.
Subnet address:
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address:
8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.
Total address:
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.
Sockets:
A socket is a data structure maintained by the system
to handle network connections. A socket is created using the call socket
. It returns an integer that is like a file
descriptor. In fact, under Windows, this handle can be used with Read File
and Write File
functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here “family” will be AF_INET
for IP communications, protocol
will be zero, and type
will depend on whether TCP or UDP is used. Two
processes wishing to communicate over a network create a socket each. These are
similar to two ends of a pipe – but the actual pipe does not yet exist.
6.8 JFREE CHART:
JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.
6.8.1. Map Visualizations:
Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the existing
XYPlot class in JFreeChart; Testing, documenting, testing some more,
documenting some more.
6.8.2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.
6.8.3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.
6.8.4. Property Editors
The property editor mechanism in JFreeChart only
handles a small subset of the properties that can be set for charts. Extend (or
reimplement) this mechanism to provide greater end-user control over the
appearance of the charts.
CHAPTER 7
7.0 APPENDIX
7.1 SAMPLE SCREEN SHOTS:
7.2
SAMPLE SOURCE CODE:
CHAPTER 8
8.1 CONCLUSION:
We try to dissipate the mist on the the locations of spoofers based on investigating the path backscatter messages. In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known.
We presented two effective algorithms to
apply PIT in large scale networks and proofed their correctness. We
demonstrated the effectiveness of PIT based on deduction and simulation. We
showed the captured locations of spoofers through applying PIT on the path
backscatter dataset. These results can help further reveal IP spoofing, which
has been studied for long but never well understood.