Secure and Distributed Data Discovery and Dissemination in Wireless Sensor Networks

05/08/201902/07/2019 by admin

—A data discovery and dissemination protocol for wireless sensor networks (WSNs) is responsible for updatingconfiguration parameters of, and distributing management commands to, the sensor nodes. All existing data discovery anddissemination protocols suffer from two drawbacks. First, they are based on the centralized approach; only the base station candistribute data items. Such an approach is not suitable for emergent multi-owner-multi-user WSNs. Second, those protocols werenot designed with security in mind and hence adversaries can easily launch attacks to harm the network. This paper proposes thefirst secure and distributed data discovery and dissemination protocol named DiDrip. It allows the network owners to authorizemultiple network users with different privileges to simultaneously and directly disseminate data items to the sensor nodes.Moreover, as demonstrated by our theoretical analysis, it addresses a number of possible security vulnerabilities that we haveidentified. Extensive security analysis show DiDrip is provably secure. We also implement DiDrip in an experimental network ofresource-limited sensor nodes to show its high efficiency in practice.Index Terms—Distributed data discovery and dissemination, security, wireless sensor networks, efficiencyÇ1 INTRODUCTIONAFTER a wireless sensor network (WSN) is deployed,there is usually a need to update buggy/old small programsor parameters stored in the sensor nodes. This can beachieved by the so-called data discovery and dissemination protocol,which facilitates a source to inject small programs,commands, queries, and configuration parameters to sensornodes. Note that it is different from the code disseminationprotocols (also referred to as data dissemination or reprogrammingprotocols) [1], [2], which distribute large binariesto reprogram the whole network of sensors. For example,efficiently disseminating a binary file of tens of kilobytesrequires a code dissemination protocol while disseminatingseveral 2-byte configuration parameters requires data discoveryand dissemination protocol. Considering the sensornodes could be distributed in a harsh environment,remotely disseminating such small data to the sensor nodesthrough the wireless channel is a more preferred and practicalapproach than manual intervention.In the literature, several data discovery and disseminationprotocols [3], [4], [5], [6] have been proposed for WSNs.Among them, DHV [3], DIP [5] and Drip [4] are regarded asthe state-of-the-art protocols and have been included in theTinyOS distributions. All proposed protocols assume thatthe operating environment of the WSN is trustworthy andhas no adversary. However, in reality, adversaries exist andimpose threats to the normal operation of WSNs [7], [8].This issue has only been addressed recently by [7] whichidentifies the security vulnerabilities of Drip and proposesan effective solution.More importantly, all existing data discovery and disseminationprotocols [3], [4], [5], [6], [7] employ the centralizedapproach in which, as shown in the top sub-figurein Fig. 1, data items can only be disseminated by the basestation. Unfortunately, this approach suffers from the singlepoint of failure as dissemination is impossible whenthe base station is not functioning or when the connectionbetween the base station and a node is broken. In addition,the centralized approach is inefficient, non-scalable, andvulnerable to security attacks that can be launched anywherealong the communication path [2]. Even worse,some WSNs do not have any base station at all. For example,for a WSN monitoring human trafficking in acountry’s border or a WSN deployed in a remote area tomonitor illicit crop cultivation, a base station becomes anattractive target to be attacked. For such networks, datadissemination is better to be carried out by authorized networkusers in a distributed manner.Additionally, distributed data discovery and disseminationis an increasingly relevant matter in WSNs, especiallyin the emergent context of shared sensor networks, wheresensing/communication infrastructures from multiple ownerswill be shared by applications from multiple users. Forexample, large scale sensor networks are built in recent_ D. He is with the School of Computer Science and Engineering, South ChinaUniversity of Technology, Guangzhou 510006, P.R. China, and also withthe College of Computer Science and Technology, Zhejiang University,Hangzhou 310027, P.R. China. E-mail: hedaojinghit@gmail.com._ S. Chan is with the Department of Electronic Engineering, City Universityof Hong Kong, Hong Kong SAR, P. R. China.E-mail: eeschan@cityu.edu.hk._ M. Guizani is with Qatar University, Qatar. E-mail: mguizani@ieee.org._ H. Yang is with the School of Computer Science and Engineering, Universityof Electronic Science and Technology of China, P. R. China.E-mail: haomyang@uestc.edu.cn._ B. Zhou is with the College of Computer Science, Zhejiang University,P. R. China. E-mail: zby@zju.edu.cn.Manuscript received 31 Dec. 2013; revised 29 Mar. 2014; accepted 31 Mar.2014. Date of publication 10 Apr. 2014; date of current version 6 Mar. 2015.recommended for acceptance by V. B. Misic.For information on obtaining reprints of this article, please send e-mail to:reprints@ieee.org, and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2316830IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015 11291045-9219 _ 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.projects such as Geoss [9], NOPP [10] and ORION [11].These networks are owned by multiple owners and used byvarious authorized third-party users. Moreover, it isexpected that network owners and different users may havedifferent privileges of dissemination. In this context, distributedoperation by networks owners and users with differentprivileges will be a crucial issue, for which efficient solutionsare still missing.Motivated by the above observations, this paper has thefollowing main contributions:1) The need of distributed data discovery and disseminationprotocols is not completely new, but previouswork did not address this need. We study the functionalrequirements of such protocols, and set theirdesign objectives. Also, we identify the security vulnerabilitiesin previously proposed protocols.2) Based on the design objectives, we propose DiDrip.It is the first distributed data discovery and disseminationprotocol, which allows network owners andauthorized users to disseminate data items intoWSNs without relying on the base station. Moreover,our extensive analysis demonstrates thatDiDrip satisfies the security requirements of theprotocols of its kind. In particular, we apply theprovable security technique to formally provethe authenticity and integrity of the disseminateddata items in DiDrip.3) We demonstrate the efficiency of DiDrip in practiceby implementing it in an experimental WSN withresource-limited sensor nodes. This is also the firstimplementation of a secure and distributed data discoveryand dissemination protocol.The rest of this paper is structured as follows. In Section2, we first survey the existing data discovery anddissemination protocols, and then discuss their securityweaknesses. Section 3 describes the requirements for asecure and distributed extension of such protocols. Section4 presents the network, trust and adversary models.Section 5 describes DiDrip in details. Section 6 providestheoretical analysis of the security properties of DiDrip.Section 7 describes the implementation and experimentalresults of DiDrip via real sensor platforms. Finally,Section 8 concludes this paper.2 SECURITY VULNERABILITIES IN DATADISCOVERY AND DISSEMINATION2.1 Review of Existing ProtocolsThe underlying algorithm of both DIP and Drip is Trickle[12]. Initially, Trickle requires each node to periodicallybroadcast a summary of its stored data. When a node hasreceived an older summary, it sends an update to thatsource. Once all nodes have consistent data, the broadcastinterval is increased exponentially to save energy. However,if a node receives a new summary, it will broadcast thismore quickly. In other words, Trickle can disseminatenewly injected data very quickly. Among the existing protocols,Drip is the simplest one and it runs an independentinstance of Trickle for each data item.In practice, each data item is identified by a unique keyand its freshness is indicated by a version number. Forexample, for Drip, DIP and DHV, each data item is representedby a 3-tuple <key; version; data> , where key is usedto uniquely identify a data item, version indicates the freshnessof the data item (the larger the version, the fresher thedata), and data is the actual disseminated data (e.g., command,query or parameter).2.2 Security VulnerabilitiesAn adversary can first place some intruder nodes in the networkand then use them to alter the data being disseminatedor forge a data item. This may result in some importantparameters being erased or the entire network beingrebooted with wrong data. For example, consider a new dataitem (key, version, data) being disseminated. When anintruder node receives this new data item, it can broadcast amalicious data item (key, version_, data_), where version_ >version. If data_ is set to 0, the parameter identified by key willbe erased from all sensor nodes. Alternatively, if data_ is differentfrom data, all sensor nodes will update the parameteraccording to this forged data item. Note that the aboveattacks can also be launched if an adversary compromisessome nodes and has access to their key materials.In addition, since nodes executing Trickle are required toforward all new data items that they receive, an adversarycan launch denial-of-service (DoS) attacks to sensor nodesby injecting a large amount of bogus data items. As a result,the processing and energy resources of nodes are expendedFig. 1. System overview of centralized and distributed data discovery and dissemination approaches.1130 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015to process and forward these bogus data items, rather thanon the intended functions. Any data discovery and disseminationprotocol based on Trickle or its variants is vulnerableto such a DoS attack.3 REQUIREMENTS AND DESIGN CONSIDERATIONA secure and distributed data discovery and disseminationprotocol should satisfy the following requirements:1) Distributed. Multiple authorized users should beallowed to simultaneously disseminate data itemsinto the WSN without relying on the base station.2) Supporting different user privileges. To provide flexibility,each user may be assigned a certain privilegelevel by the network owner. For example, a user canonly disseminate data items to a set of sensor nodeswith specific identities and/or in a specific localizedarea. Another example is that a user just has the privilegeto disseminate data items identified by somespecific keys.3) Authenticity and integrity of data items. A sensor nodeonly accepts data items disseminated by authorizedusers. Also, a sensor should be able to ensure thatreceived data items have not been modified duringthe dissemination process.4) User accountability. User accountability must be providedsince bad user behaviors and insider attacksshould be audited and pinpointed. That is, a sendershould not be able to deny the distribution of a dataitem. At the same time, an adversary cannot impersonateany legitimate user even if it has compromisedthe network owner or the other legitimateusers. In many applications, accountability is desirableas it enables collection of users’ activities. Forexample, from the dissemination record in sensornodes, the network owner can find out who disseminatesmost data. This requires the sensor nodes to beable to associate each disseminated data with thecorresponding user’s identity.5) Node compromise tolerance. The protocol should beresilient to node compromise attack no matter howmany nodes have been compromised, as long as thesubset of non-compromised nodes can still form aconnected graph with the trusted source.6) User collusion tolerance. Even if an adversary has compromisedsome users, a benign node should notgrant the adversary any privilege level beyond thatof the compromised users.7) DoS attacks resistance. The functions of the WSNshould not be disrupted by DoS attacks.8) Freshness. A node should be able to differentiatewhether an incoming data item is the newestversion.9) Low energy overhead. Most sensor nodes have limitedresources. Thus, it is very important that the securityfunctions incur low energy overhead, which can bedecomposed to communication and computationoverhead.10) Scalability. The protocol should be efficient even forlarge-scale WSNs with thousands of sensors andlarge user population.11) Dynamic participation. New sensor nodes and userscan be dynamically added to the network.In order to ensure security, each step of the existing datadiscovery and dissemination protocol runs should be identifiedand then protected. In other words, although code disseminationprotocols may share the same securityrequirements as listed above, their security solutions needto be designed in accordance with their characteristics. Consideringthe well known open-source code disseminationprotocol Deluge [1] as an example. Deluge uses an epidemicprotocol based on a page-by-page dissemination strategyfor efficient advertisement of metadata. A code image isdivided into fixed-size pages, and each page is further splitinto same-size packets. Due to such a way of decomposingcode images into packets, our proposed protocol is notapplicable for securing Deluge.The primary challenge of providing security functions inWSNs is the limited capabilities of sensor nodes in terms ofcomputation, energy and storage. For example, to provideauthentication function to disseminated data, a commonlyused solution is digital signature. That is, users digitallysign each packet individually and nodes need to verify thesignature before processing it. However, such an asymmetricmechanism incurs significant computational and communicationoverhead and is not applicable to sensor nodes.To address this problem, TESLA and its various extensionshave been proposed [13], [14], which are based on thedelayed disclosure of authentication keys, i.e., the key usedto authenticate a message is disclosed in the next message.Unfortunately, due to the authentication delay, these mechanismsare vulnerable to a flooding attack which causeseach sensor node to buffer all forged data items until thedisclosed key is received.Another possible approach to authentication is by symmetrickey cryptography. However, this approach is vulnerableto node compromise attack because once a node iscompromised, the globally shared secret keys are revealed.Here we choose digital signatures over other forms forupdate packet authentication. That is, the network ownerassigns to each network user a public/private key pair thatallows the user to digitally sign data items and thus authenticateshimself/herself to the sensor nodes. We propose twohybrid approaches to reduce the computation and communicationcost. These methods combine digital signature withefficient data Merkle hash tree and data hash chain, respectively.The main idea is that signature generation and verificationare carried out over multiple packets instead ofindividual packet. In this way, the computation cost perpacket is significantly reduced. Since elliptic curve cryptography(ECC) is computational and communication efficientcompared with the traditional public key cryptography,DiDrip is based on ECC.To prevent the network owner from impersonatingusers, user certificates are issued by a certificate authority ofa public key infrastructure (PKI), e.g., local police office.4 NETWORK, TRUST AND THREAT MODELS4.1 Network ModelAs shown in the bottom subfigure in Fig. 1, a generalWSN comprises a large number of sensor nodes. It isHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1131administrated by the owner and accessible by many users.The sensor nodes are usually resource constrained withrespect to memory space, computation capability, bandwidth,and power supply. Thus, a sensor node can onlyperform a limited number of public key cryptographicoperations during the lifetime of its battery. The networkusers use some mobile devices to disseminate data itemsinto the network. The network owner is responsible for generatingkeying materials. It can be offline and is assumed tobe uncompromisable.4.2 Trust ModelNetworks users are assigned dissemination privileges bythe trusted authority in a PKI on behalf of the networkowner. However, the network owner may, for various reasons,impersonate network users to disseminate data items.4.3 Threat ModelThe adversary considered in this paper is assumed to becomputationally resourceful and can launch a wide rangeof attacks, which can be classified as external or insiderattacks. In external attacks, the adversary has no control ofany sensor node in the network. Instead, it would eavesdropfor sensitive information, inject forged messages,launch replay attack, wormhole attacks, DoS attacks andimpersonate valid sensor nodes. The communication channelmay also be jammed by the adversary, but this can onlylast for a certain period of time after which the adversarywill be detected and removed.By compromising either network users or sensor nodes,the adversary can launch insider attacks to the network. Thecompromised entities are regarded as insiders because theyare members of the network until they are identified. Theadversary controls these entities to attack the network in arbitraryways. For instance, they could be instructed to disseminatefalse or harmful data, launch attacks such as Sybil attacksor DoS attacks, and be non-cooperative with other nodes.5 DIDRIPReferring to the lower sub-figure in Fig. 1, DiDrip consists offour phases, system initialization, user joining, packet preprocessingand packet verification. For our basic protocol,in system initialization phase, the network owner creates itspublic and private keys, and then loads the public parameterson each node before the network deployment. In theuser joining phase, a user gets the dissemination privilegethrough registering to the network owner. In packet preprocessingphase, if a user enters the network and wants todisseminate some data items, he/she will need to constructthe data dissemination packets and then send them to thenodes. In the packet verification phase, a node verifies eachreceived packet. If the result is positive, it updates the dataaccording to the received packet. In the following, eachphase is described in detail. The notations used in thedescription are listed in Table 1. The information processingflow of DiDrip is illustrated in Fig. 2.5.1 System Initialization PhaseIn this phase, an ECC is set up. The network owner carriesout the following steps to derive a private key x and somepublic parameters fy; Q; p; q; hð:Þg. It selects an elliptic curveE over GFðpÞ, where p is a big prime number. Here Qdenotes the base point ofE while q is also a big prime numberand represents the order of Q. It then selects the private keyx 2 GFðqÞ and computes the public key y ¼ xQ. After that,the public parameters are preloaded in each node of the network.We consider 160-bit ECC as an example. In this case, yandQ are both 320 bits long while p and q are 160 bits long.5.2 User Joining PhaseThis phase is invoked when a user with the identity UIDj,say Uj, hopes to obtain privilege level. User Uj chooses theprivate key SKj 2 GFðqÞ and computes the public keyPKj ¼ SKj_Q. Here the length of UIDj is set to 2 bytes, inthis case, it can support 65,536 users. Similarly, assume that160-bit ECC is used, PKj and SKj are 320 bits and 160 bitslong, respectively. Then user Uj sends a 3-tuple <UIDj;Prij; PKj > to the network owner, where Prij denotes thedissemination privilege of user Uj. Upon receiving this message,the network owner generates the certificate Certj. Aform of the certificate consists of the following contents:Certj ¼ fUIDj; PKj; Prij; SIGxfhðUIDjkPKjkPrijÞg, wherethe length of Prij is set to 6 bytes, thus the length of Certj is88 bytes.TABLE 1NotationsFig. 2. Information processing flow in DiDrip.1132 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 20155.3 Packet Pre-Processing PhaseAssume that a user, say Uj, enters the WSN and wants todisseminate n data items: di ¼ fkeyi; versioni; dataig, i ¼ 1;2; . . .; n. For the construction of the packets of the respectivedata, we have two methods, i.e., data hash chain and theMerkle hash tree [15].For data hash chain approach, a packet, say Pi is composedof packet header, di, and the hash value of packet Piþ1(i.e., Hiþ1 ¼ hðPiþ1Þ) which is used to verify the next packet,where i ¼ 1; . . .; n _ 1. Here each cryptographic hash Hi iscalculated over the full packet Pi, not just the data portion di,thereby establishing a chain of hashes. After that, user Ujuses his/her private key SKj to run an ECDSA sign operationto sign the hash value of the first data packet hðP1Þ and thencreates an advertisement packet P0, which consists of packetheader, user certificate Certj, hðP1Þ and the signatureSIGSKjfhðP1Þg. Similarly, the network owner assigns a predefinedkey to identify this advertisement packet.With the method of Merkle hash tree, user Uj builds aMerkle hash tree from the n data items in the following way.All the data items are treated as the leaves of the tree. A newset of internal nodes at the upper level is formed; each internalnode is computed as the hash value of the concatenationof two child nodes. This process is continued until the rootnode Hroot is formed, resulting in a Merkle hash tree withdepth D ¼ log2ðnÞ. Before disseminating the n data items,user Uj signs the root node with his/her private key SKj andthen transmits the advertisement packet P0 comprising usercertificate Certj,Hroot and SIGSKjfHrootg. Subsequently, userUj disseminates each data item along with the appropriateinternal nodes for verification purpose. Note that asdescribed above, user certificate Certj contains user identityinformation UIDj and dissemination privilege Prij. Beforethe network deployment, the network owner assigns a predefinedkey to identify this advertisement packet.5.4 Packet Verification PhaseWhen a sensor node, say Sj, receives a packet either from anauthorized user or from its one-hop neighbours, it firstchecks the packet’s key field1) If this is an advertisement packet (P0 fCertj; hðP1Þ;SIGSKjfhðP1Þgg for the data hash chain method whileP0 ¼ fCertj; root; SIGSKjfrootgg for the Merkle hashtree method), node Sj first pays attention to the legalityof the dissemination privilege Prij. For example,node Sj needs to check whether the identity of itself isincluded in the node identity set of Prij. If the resultis positive, node Sj uses the public key y of the networkowner to run an ECDSA verify operation toauthenticate the certificate. If the certificate Certj isvalid, node Sj authenticates the signature. If yes, forthe data hash chain method (respectively, the Merklehash tree method), node Sj stores <UIDj;H1 >(respectively, <UIDj; root>) included in the advertisementpacket; otherwise, node Sj simply discardsthe packet.2) Otherwise, it is a data packet Pi, where i ¼ 1; 2; . . . ;n. Node Sj executes the following procedure:For the data hash chain method, node Sj checks theauthenticity and integrity of Pi by comparing the hashvalue of Pi with Hi which has been received in thesame round and verified. If the result is positive andthe version number is new, node Sj then updates thedata identified by the key stored in Pi and replaces itsstored <round, Hi > by <round, Hiþ1 > (hereHiþ1 isincluded in packet Pi); otherwise, Pi is discarded.For Merkle hash tree method, node Sj checks theauthenticity and integrity of Pi through the alreadyverified root node received in the same round. If theresult is positive and the version number is new, nodeSj then updates the data identified by the key storedin Pi; otherwise, Pi is discarded.Remark: To prevent the network owner from impersonatingusers, system initialization and issue of user certificatescan be carried out by the certificate authority of aPKI rather than the network owner.Comparing the two methods, the data hash chainmethod incurs less communication overhead than the Merklehash tree method. In the data hash chain method, onlyone hash value of a packet is included in each packet. Onthe contrary, in the Merkle hash tree method, D (the treedepth) hash values are included in each packet. However, alimitation of the data hash tree method is that it just workswell in networks with in-sequence packet delivery. Such alimitation does not exist in the Merkle hash tree methodsince it allows each packet to be immediately authenticatedupon its arrival at a node. Therefore, the choice of eachmethod depends on this characteristic of the WSNs.5.5 EnhancementsWe can enhance the efficiency and security of DiDrip byadding additional mechanisms. Readers are referred to theAppendix of this paper for details.6 SECURITY ANALYSIS OF DIDRIPIn the following, we will analyze the security of DiDrip toverify that the security requirements mentioned in Section 3are satisfied.Distributed. As described in Section 5.2, in order to passthe signature verification of sensor nodes, each user has tosubmit his/her private key and dissemination privilege tothe network owner for registration. In addition, as describedabove, authorized users are able to carry out disseminationin a distributed manner.Supporting different user privileges. Activities of networkusers can be restricted by setting user privilege Prij, whichis contained in the user certificate. Since each user certificateis generated based on Prij, it will not pass the signature verificationat sensor nodes if Prij is modified. Thus, only thenetwork owner can modify Prij and then updates the certificateaccordingly.Authenticity and integrity of data items. With the Merklehash tree method (rsp. the data hash chain method), anauthorized user signs the root of the Merkle hash tree (rsp.the hash value of the first data packet hðP1Þ) with his/her privatekey. Using the network owner’s public key, each sensornode can authenticate the user certificate and obtains theuser’s public key. Then, using the user’s public key, eachnode can authenticate the root of the Merkle hash tree (rsp.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1133the hash value of the first data packet hðP1Þ). Subsequently,each node can authenticate other data packets based on theMerkle hash tree (rsp. the data hash chain). With the assumptionthat the network owner cannot be compromised, it isguaranteed that any forged or modified data items can beeasily detected by the authentication process.User accountability. Users’ identities and their disseminationactivities are exposed to sensor nodes. Thus, sensornodes can report such records to the network owner periodically.Since each user certificate is generated according tothe user identity, except the network owner, no one canmodify the user identity contained in the user certificatewhich passes the authentication. Therefore, users cannotrepudiate their activities.Node compromise and user collusion tolerance. As describedabove, for basic protocol, only the public parameters arepreloaded in each node. Even for the improved protocol,the public-key/dissemination-privilege pair of each networkuser is loaded into the nodes. Therefore, no matterhow many sensor nodes are compromised, the adversaryjust obtains the public parameters and the public-key/dissemination-privilege pair of each user. Clearly, the adversarycannot launch any attack by compromising sensornodes. As described in Section 5, even if some users collude,a benign node will not grant any dissemination privilegethat is beyond those of colluding users.Resistance to DoS attacks. There are DoS attacks against basicDiDrip by exploiting: (1) authentication delays, (2) the expensivesignature verifications, and (3) the Trickle algorithm.First, with the use of Merkle hash tree or data hash chain,each node can efficiently authenticate a data packet by afew hash operations. Second, using the message specificpuzzle approach, each node can efficiently verify a puzzlesolution to filter a fake signature message and to forward adata packet using Trickle without waiting for signature verification.Therefore, all the above DoS attacks are defended.DiDrip can successfully defeat all three types of DoSattacks even if there are compromised network users andsensor nodes. Indeed, without the private key and the unreleasedpuzzle keys of the network users, even an insideattacker cannot forge any signature/data packets.Ensurance of freshness. If the privilege of a user allows him/her to disseminate data items to his/her own set of nodes, theversion number in each item can ensure the freshness ofDiDrip. On the other hand, if a node receives data items frommultiple users, the version number can be replaced by a timestampto indicate the freshness of a data item. More specifically,a timestamp is attached into the root of the Merkle hashtree (or the hash value of the first data packet).Scalability. Different from centralized approaches, anauthorized user can enter the network and then disseminatesdata items into the targeted sensor nodes. Moreover,as to be demonstrated by our experiments in a testbed with24 TelosB motes in the next section, the security functions inour protocol have low impact on propagation delay. Notethat the increase in propagation delay is dominated by thesignature verification time incurred at the one-hop neighboringnodes of the authorized user. Thus, the proposedprotocol is efficient even in a large-scale WSN with thousandsof sensor nodes. Also, as shown in Section 5.2, ourprotocol can support a large number of users.Also, as described above, DiDrip can achieve dynamic participation.Moreover, in the next section, our implementationresults will demonstrate DiDrip has lowenergy overhead.In the following, we give the formal proof of the authenticityand integrity of the disseminated data items in DiDripbased on the three assumptions below:Assumption 1: There exist pseudo-random functionswhich are polynomially indistinguishable from truly randomfunctions.Assumption 2: There exist target collision-resistance(TCR) hash functions [16], where if for all probabilistic-polynomial-time (PPT) adversaries, say A, A have negligibleprobability in winning the following game: A first choose amessage m, and then A are given a random function hð:Þ. Towin, A must output m0 6¼m such that hðm0Þ ¼ hðmÞ. Notethat in our scheme, the TCR hash function can be implementedby the common hash functions, such as SHA-1.Assumption 3: ECC signature is existentially unforgeableunder adaptive chosen-message attacks. Note thatin our scheme, ECC signature can use the standardECDSA of 160 bits.Theorem 1. DiDrip achieves the authenticity and integrity of dataitems, assuming the indistinguishability between pseudo-randomnessand true randomness, and assuming that hð:Þ is aTCR hash function and ECC signature is existentially unforgeableunder adaptive chosen-message attacks.Proof. Our theorem follows from Theorem A.1 in [16], andthus here we only give a proof sketch briefly. To beginwith, we assume ECC signature generation and verificationguarantee authenticity and integrity of signed messages,and every receiver has obtained an authentic copyof the legitimate sender’s public key. And we also assumethat hð:Þ is a TCR hash function. Therefore, here the securityof DiDrip is proven based on the indistinguishabilitybetween pseudo-randomness and true randomness.First, there exists a PPT adversary A, which can defeatauthenticity of data items in DiDrip. This means that Acontrols the communication links and manages, withnon-negligible probability, to deliver a message m to areceiver R, such that the sender S has not sent m but Raccepts m as authentic and coming from S. Then there isa PPT adversary B that uses A to break the indistinguishabilitybetween pseudo-randomness and true randomnesswith non-negligible advantage. That is, B getsaccess to h (as an oracle) and can tell with non-negligibleprobability if h is a pseudorandom function (PRFð:Þ) orif h is a totally random function.To this end, B can query on inputs x of its choice and beanswered with hðxÞ. Hence first B simulates for A a networkwith a sender S and a receiver R. Then B works byrunning A in the way similar to that in [17]. Namely, Bchooses a number l 2 f1; . . .; ng at random, where n is thetotal number of packets to be sent in the data dissemination.Note that B hopes thatAwill forge the lth packet Pl.B grants access to the oracle h, which is either PRFð:Þor an ideal random function. B can adaptively query anarbitrarily chosen x to the oracle and get the outputwhich is either PRFðxÞ or a random value uniformlyselected from f0; 1g_. After performing polynomiallymany queries, B finally makes the decision of whether or1134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015not the oracle is PRFð:Þ or the ideal random function. Asa result, B wins the game if the decision is correct.We argue that B succeeds with non-negligible probabilitysketchily. If h is a truly random function then A hasonly negligible probability to successfully forge thepacket Pl in the data items. Therefore, if h is random thenB makes the wrong decision only with negligible probability.On the other hand, we have assumed that if theauthentication is done using PRFð:Þ then A forges somepacket with non-negligible probability _. It follows that ifh is PRFð:Þ then B makes the right decision with probabilityat least _=l (which is also non-negligible).In addition, if the adversary A is able to cause areceiver R to accept a forged packet Pl, it implies theadversary A is able to find a collision on hðP0lÞ ¼ hðPlÞ inthe packet Pl_1. However, according to Assumption 2,hð:Þ is a TCR hash function. Moreover, due to the unforgeabilityof ECC signature (Assumption 3), it is impossiblethat A hands R a forged initial packet from S (Fordata hash chain approach, A forges the signatureSIGSKj ðP1Þ; For Merkle hash tree method, A forges thesignature SIGSKj ðrootÞ. Therefore, the above contradictionsmeans also that the authenticity and integrity ofdata items in DiDrip. tu7 IMPLEMENTATION AND PERFORMANCEEVALUATIONWe evaluate DiDrip by implementing all components on anexperimental test-bed. Also, we choose Drip for performancecomparison.7.1 Implementation and Experimental SetupWe have written programs that execute the functions of thenetwork owner, user and sensor node. The network ownerand user side programs are C programs using OpenSSL [18]and running on laptop PCs (with 2 GB RAM) under Ubuntu11.04 environment with different computational power.Also, the sensor node side programs are written in nesCand run on resource-limited motes (MicaZ and TelosB). TheMicaZ mote features an 8-bit 8-MHz Atmel microcontrollerwith 4-kB RAM, 128-kB ROM, and 512 kB of flash memory.Also, the TelosB mote has an 8-MHz CPU, 10-kB RAM, 48-kB ROM, 1MB of flash memory, and an 802.15.4/ZigBeeradio. Our motes run TinyOS 2.x. Additionally, SHA-1 isused, and the key sizes of ECC are set to 128 bits, 160 and192 bits, respectively. Throughout this paper, unlessotherwise stated, all experiments on PCs (respectively, sensornodes) were repeated 100,000 times (respectively,1,000 times) for each measurement in order to obtain accurateaverage results.To implement DiDrip with the data hash chain method(rsp. the Merkle hash tree method), the following functionalitiesare added to the user side program of Drip: constructionof data hash chain (rsp. Merkle hash tree) of around of dissemination data, generation of the signaturepacket and all data packets. For obtaining version numberof each data item, the DisseminatorC and DisseminatorPmodules in the Drip nesC library has been modified toprovide an interface called DisseminatorVersion. Moreover,the proposed hash tree method is implemented withoutand with using the message specific puzzle approachpresented in Appendix, resulting in two implementationsof DiDrip; DiDrip1 and DiDrip2. In DiDrip1, when a nodereceives a signature/data packet with a new version number,it authenticates the packet before broadcasting it to itsnext-hop neighbours. On the other hand, in DiDrip2, anode only checks the puzzle solution in the packet beforebroadcasting the packets. We summarize the pros andcons of all related protocols in Table 2.Based on the design of DiDrip, we implement the verificationfunction for signature and data packets based onthe ECDSA verify function and SHA-1 hash function ofTinyECC 2.0 library [19] and add them to the Drip nesClibrary. Also, in our experiment, when a network user(i.e., a laptop computer) disseminates data items, it firstsends them to the serial port of a specific sensor node inthe network which is referred to as repeater. Then, therepeater carries out the dissemination on behalf of theuser using DiDrip.Similar to [7], we use a circuit to accurately measure thepower consumption of various cryptographic operationsexecuted in a mote. The Tektronix TDS 3034C digital oscilloscopeaccurately measures the voltage Vr across the resistor.Denoting the battery voltage as Vb (which is 3 volts in ourexperiments), the voltage across the mote Vm is then Vb _ Vr.Once Vr is measured, the current through the circuit I canbe obtained by using Ohm’s law. The power consumed bythe mote is then VmI. By also measuring the execution timeof the cryptographic operation, we can obtain the energyconsumption of the operation by multiplying the powerand execution time.7.2 Evaluation ResultsThe following metrics are used to evaluate DiDrip; memoryoverhead, execution time of cryptographic operations andpropagation delay, and energy overhead. The memoryoverhead measures the required data space in the implementation.The propagation delay is defined as the timefrom construction of a data hash chain until the parameterson all sensor nodes corresponding to a round of disseminateddata items are updated.TABLE 2The Pros and Cons of All Related ProtocolsHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1135Table 3 shows the execution times of some importantoperations in DiDrip. For example, the execution times forthe system initialization phase and signing a random20-byte message (i.e., the output of SHA-1 function) are1.608 and 0.6348 ms on a 1.8-GHz Laptop PC, respectively.Thus, if SHA-1 is used, generating a user certificate or signinga message takes 0.6348 ms on a 1.8-GHz Laptop PC.Fig. 3 shows the execution times of SHA-1 hash function(extracted from TinyECC 2.0 [19]) on MicaZ and TelosBmotes. The inputs to the hash function are randomly generatednumbers with length varying from 24 to 156 bytes inincrements of 6 bytes. Note that, in our protocol, the hashfunction is applied to an entire packet. There are several reasonsthat, possibly, a packet contains a few tens of bytes.First, the advertisement packet has additional informationsuch as certificate and signature. Second, with the Merklehash tree method, each packet contains the disseminateddata item along with the related internal nodes of the treefor verification purpose. Third, although several bytes is atypical size of a data item, sometimes a disseminated dataitem may be a bit larger. Moreover, for sensors with IEEE802.15.4 compliant radios, the maximum payload size is 102bytes for each packet. Therefore, we have chosen a widerrange of input size to SHA-1 to provide readers a more completepicture of the performance. We perform the sameexperiment 10,000 times and take an average over them. Forexample, the execution times on a MicaZ mote for inputs of54 bytes, 114 bytes, and 156 bytes are 9.6788, 18.947, and28.0515 ms, respectively. Also, the execution times on aTelosB mote for inputs of 54, 114, and 156 bytes are 5.7263,10.7529, and 15.629 ms, respectively.To measure the execution time of public key cryptography,as shown in Table 41, we have implemented the ECCverification operation (with a random 20-byte number asthe output) of TinyECC 2.0 library [19] on MicaZ and TelosBmotes. For example, it is measured that the signature verificationtimes are 2.436 and 3.955 seconds, which are 252 and691 times longer than SHA-1 hash operation with a 54-byterandom number as input on MicaZ and TelosB motes,respectively. It can be seen packet authentication based onthe Merkle hash tree (or data hash chain) is much more efficient.Therefore, it is confirmed that DiDrip is suitable forsensor nodes with limited resources.Next, we compare the energy consumption of SHA-1 hashfunction and ECC verification under the condition that theradio of the mote is turned off. When a MicaZ mote is used inthe circuit, Vr ¼ 138 mV, I ¼ 6.7779 mA, Vm ¼ 2.8620 V,P ¼ 19.3983 mW. When a TelosB mote is used, Vr ¼ 38 mV,I ¼ 1.8664 mA, Vm ¼ 2.9620 V, P ¼ 5.5283 mW. With theexecution time obtained from Fig. 3, the energy consumptionon the motes due to the SHA-1 operation can be determined.For example, the energy consumption of SHA-1 operationwith a random 54-byte number as input on MicaZ andTelosB motes are 0.18775 and 0.03166 mJ, respectively. Also,the energy consumption of ECC signature verificationoperation on MicaZ and TelosB motes are 2835.2555 and1316.1777 mJ, respectively.Next, the impact of security functions on the propagationdelay is investigated in an experimental network as shownin Fig. 4. The network has 24 TelosB nodes arranged in a4 _ 6 grid. The distance between each node is about 35 cm,TABLE 3Running Time for Each Phase of the Basic Protocol of DiDrip (Except the Sensor Node Verification Phase)Fig. 3. The execution times of SHA-1 hash function on MicaZ and TelosBmotes.TABLE 4Running Time for ECC Signature VerificationFig. 4. The 4 _ 6 grid network of TelosB motes for measuring propagationdelay.1. Note that ECC-160 is faster than ECC-128, because the columnwidth of ECC-160 is set to 5 for hybrid multiplication optimizationwhile that of ECC-128 is set to 4.1136 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 4, APRIL 2015and the transmission power is configured to be the lowestlevel so that only one-hop neighbours are covered in thetransmission range. The repeater is acted by the node locatingat the vertex of the grid.In the experiments, the packet delivery rate from the networkuser is 5 packets/s. The lengths of round and datafields in a data item are set to 4 bits and 2 bytes, respectively.A hash function with 8-byte truncated output is usedto construct data hash chains. An ECC-160 signature is 40bytes long. Each experiment is repeated 20 times to obtainan average measurement. Figs. 5 and 6 plot the averagepropagation delays of Drip, DiDrip1, and DiDrip2 when thedata hash chain and Merkle hash tree methods areemployed, respectively. It can be seen that the propagationdelay almost increases linearly with the number of dataitems per round for all three protocols. Moreover, the securityfunctions in DiDrip2 have low impact on propagationdelay. For these five experiments of the data hash chainmethod, DiDrip2 is just 3.448, 4.158, 3.222, 3.919 and 2.855 smore than that of Drip, respectively. Note that the increasein propagation delay is dominated by the signature verificationtime incurred at the one-hop neighboring nodes of thebase station. This is because each node carries out signatureverification only after forwarding data packets (with validpuzzle solutions).Table 5 shows the memory (ROM and RAM) usage ofDiDrip2 (with the data hash chain method) on MicaZ andTelosB motes for the case of four data items per round. Thecode size of Drip and a set of verification functions fromTinyECC (secp128r1, secp160r1 and secp192r1, which areimplementations based on various elliptic curves accordingto the Standards for Efficient Cryptography Group) areincluded for comparison. For example, the size of DiDripimplementation corresponds to 26.18 and 56.82 percent of theRAMand ROMcapacities of TelosB, respectively. Clearly, theROM and RAM consumption of DiDrip is more than that ofDrip because of the extra security functions. Moreover, it canbe seen thatmajority of the increasedROMis due to TinyECC.8 CONCLUSION AND FUTURE WORKIn this paper, we have identified the security vulnerabilitiesin data discovery and dissemination when used inWSNs, which have not been addressed in previousresearch. Also, none of those approaches support distributedoperation. Therefore, in this paper, a secure and distributeddata discovery and dissemination protocolnamed DiDrip has been proposed. Besides analyzing thesecurity of DiDrip, this paper has also reported the evaluationresults of DiDrip in an experimental network ofresource-limited sensor nodes, which shows that DiDripis feasible in practice. We have also given a formal proofof the authenticity and integrity of the disseminated dataitems in DiDrip. Also, due to the open nature of wirelesschannels, messages can be easily intercepted. Thus, in thefuture work, we will consider how to ensure data confidentialityin the design of secure and distributed datadiscovery and dissemination protocols.APPENDIXFURTHER IMPROVEMENT OF DIDRIP SECURITY ANDEFFICIENCYBy the basic DiDrip protocol, we can achieve secure and distributeddata discovery and dissemination. To furtherenhance the protocol, here we propose two modifications toFig. 5. Propagation delay comparison of three protocols when the datahash chain method is employed.Fig. 6. Propagation delay comparison of three protocols when the Merklehash tree method is employed.TABLE 5Code Sizes (Bytes) on MicaZ and TelosB MotesHE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1137improve the efficiency and security of DiDrip. For brevity,only those parts of the basic protocol that require changeswill be presented.Avoiding the Generation, Transmission andVerification of CertificatesThere are some efficiency problems caused by the generation,transmission, and verification of certificates. First, it isnot efficient in communication, as the certificate has to betransmitted along with the advertisement packet acrossevery hop as the message propagates in the WSN. A largeper-message overhead will result in more energy consumptionon each sensor node. Second, to authenticate each advertisementpacket, it always takes two expensive signatureverification operations because the certificate should alwaysbe authenticated first. To address these challenges, a feasibleapproach is that before the network deployment, the publickey/dissemination-privilege pair of each network user isloaded into the sensor nodes by the network owner. Once anew user joins the network after the network deployment,the network owner can notify the sensor nodes of the user’spublic key/dissemination privilege through using the privatekey of itself. The detailed description is as follows.User Joining PhaseAccording to the basic protocol of DiDrip, user Uj generatesits public and private keys and sends a 3-tuple<UIDj; Prij; PKj > to the network owner. When the networkowner receives the 3-tuple, it no longer generates thecertificate Certj. Instead, it signs the 3-tuple with its privatekey and sends it to the sensor nodes. Finally, each nodestores the 3-tuple.Packet Pre-Processing PhaseThe user certificate Certj stored in packet P0 is replaced byUIDj.Packet Verification PhaseIf this is an advertisement packet, according to the receivedidentity UIDj, node Sj first picks up the dissemination privilegePrij from its storage and then pays attention to thelegality of Prij. If the result is positive, node Sj uses the publickey PKj from its storage to run an ECDSA verify operationto authenticate the signature; otherwise, node Sjsimply discards the packet. Note that node Sj does not needto authenticate the certificate.As described above, the public-key/dissemination-privilegepair <UIDj; Prij; PKj > of each network user is just2 þ 6 þ 40 ¼ 48 bytes. Therefore, assuming the protocolsupports 500 network users, the code size is about 23 KB.We consider the resource-limited sensor nodes such asTelosB motes as examples. The 1-MB Flash memory isenough for storing these public parameters.Message Specific Puzzle Approach for Resistance toDoS attacksDiDrip uses a digital signature to bootstrap the authenticationof a round of data discovery and dissemination. Thisauthentication is vulnerable to DoS attacks. That is, anadversary may flood a lot of illegal signature message (i.e.,advertisement messages in this paper) to the sensor nodes toexhaust their resources and render them less capable of processingthe legitimate signature messages. Such an attack canbe defended by applying the message specific puzzleapproach [2]. This approach requires each signature messageto contain a puzzle solution. When a node receives a signaturemessage, it first checks that the puzzle solution is correctbefore verifying the signature. There are two characteristicsof the puzzles. First, the puzzles are difficult to be solved buttheir solutions are easy to be verified. Second, there is a tighttime limit to solve a puzzle. This discourages adversaries tolaunch the DoS attack even if they are computationally powerful.More details about this approach can be found in [2].Another advantage of applying the message specific puzzleis to reduce the dissemination delay, which is the time fora disseminated packet to reach all nodes in a WSN. Recallthat in step 1.a) of the packet verification phase, when a nodereceives the signature packet, it first carries out the signatureverification before using the Trickle algorithm to broadcastthe signature packet. This means that the disseminationdelay depends on the signature verification time tsv. On theother hand, when the message specific puzzle approach isapplied, a node can just verify the validity of puzzle solutionbefore broadcasting the signature packet. Then, the disseminationdelay only depends on the puzzle solution verificationtime tpv. Since tpv _ tsv, the dissemination delay issignificantly reduced. Moreover, the reduction in disseminationdelay is proportional to the network size. This is demonstratedby the experiments presented in Section 7.2.ACKNOWLEDGMENTSThis research is supported by a strategic research grant fromCity University of Hong Kong [Project No. 7004054], theFundamental Research Funds for the Central Universities,and the Specialized Research Fund for the Doctoral Programof Higher Education. D. He is the correspondingauthor of this article.and MEng degrees from the Harbin Institute ofTechnology, China, and the PhD degree fromZhejiang University, China, all in computerscience in 2007, 2009, and 2012, respectively.He is with the School of Computer Science andEngineering, South China University of Technology,P.R. China, and also with the College ofComputer Science and Technology, ZhejiangUniversity, P.R. China. His research interestsinclude network and systems security. He is anassociate editor or on the editorial board of some international journalssuch as IEEE Communications Magazine, Springer Journal of WirelessNetworks, Wiley’s Wireless Communications and Mobile ComputingJournal, Journal of Communications and Networks, Wiley’s Security andCommunication Networks Journal, and KSII Transactions on Internetand Information Systems. He is a member of the IEEE.Sammy Chan (S’87-M’89) received the BE andMEngSc degrees in electrical engineering fromthe University of Melbourne, Australia, in 1988 and1990, respectively, and the PhD degree in communicationengineering from the Royal MelbourneInstitute of Technology, Australia, in 1995. From1989 to 1994, he was with Telecom AustraliaResearch Laboratories, first as a research engineer,and between 1992 and 1994 as a seniorresearch engineer and project leader. SinceDecember 1994, he has been with the Departmentof Electronic Engineering, City University of Hong Kong, where he is currentlyan associate professor. He is a member of the IEEE.Mohsen Guizani (S’85-M’89-SM’99-F’09)received the BS (with distinction) and MSdegrees in electrical engineering, the MS andPhD degrees in computer engineering in 1984,1986, 1987, and 1990, respectively, from SyracuseUniversity, Syracuse, New York. He is currentlya professor and the associate vicepresident for Graduate Studies at Qatar University,Qatar. His research interests include computernetworks, wireless communications andmobile computing, and optical networking. Hecurrently serves on the editorial boards of six technical journals andthe founder and EIC of “Wireless Communications and MobileComputing” Journal published by John Wiley (http://www.interscience.wiley.com/jpages/1530-8669/). He is a fellow of the IEEE and a seniormember of ACM.Haomiao Yang (M’12) received the MS and PhDdegrees in computer applied technology from theUniversity of Electronic Science and Technologyof China (UESTC) in 2004 and 2008, respectively.From 2012 to 2013, he worked as a postdoctoralfellow at Kyungil University, Republic ofKorea. Currently, he is an associate professor atthe School of Computer Science and Engineering,UESTC, China. His research interestsinclude cryptography, cloud security, and bigdata security. He is a member of the IEEE.Boyang Zhou is currently working toward thePhD degree from the College of Computer Scienceat Zhejiang University. His research areasinclude software-defined networking, futureinternet architecture and flexible reconfigurablenetworks.” For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.HE ET AL.: SECURE AND DISTRIBUTED DATA DISCOVERY AND DISSEMINATION IN WIRELESS SENSOR NETWORKS 1139

Receiver Cooperation in Topology Control for Wireless Ad-Hoc Networks

05/08/201902/07/2019 by admin

Abstract—We propose employing receiver cooperation in centralizedtopology control to improve energy efficiency as well asnetwork connectivity. The idea of transmitter cooperation hasbeen widely considered in topology control to improve networkconnectivity or energy efficiency. However, receiver cooperationhas not previously been considered in topology control. In particular,we show that we can improve both connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. Consequently, we conclude that a system based bothon transmitter and receiver cooperation is generally superior toone based only on transmitter cooperation. We also show that theincrease in network connectivity caused by employing transmittercooperation in addition to receiver cooperation is at the expense ofsignificantly increased energy consumption. Consequently, systemdesigners may opt for receiver-only cooperation in cases for whichenergy efficiency is of the highest priority or when connectivityincrease is no longer a serious concern.Index Terms—Ad-hoc network, energy efficiency, multi-hopcommunications, network connectivity, receiver cooperation,topology control, transmitter cooperation.I. INTRODUCTIONTHE wireless ad-hoc network has been receiving growingattention during the last decade for its various advantagessuch as instant deployment and reconfiguration capability. Ingeneral, a node in a wireless ad-hoc network suffers fromconnectivity instability because of channel quality variation andlimited battery lifespan. Therefore, an efficient algorithm forcontrolling the communication links among nodes is essentialfor the construction of a wireless ad-hoc network. In a topologycontrol scheme, communication links among nodes are definedto achieve certain desired properties for connectivity, energyconsumption, mobility, network capacity, security, and so on.In this paper, we propose topology control schemes that aimManuscript received February 23, 2014; revised July 24, 2014 and November12, 2014; accepted November 12, 2014. Date of publication December 4, 2014;date of current version April 7, 2015. Part of this work was presented at IEEEWCNC, Shanghai, China, April 2013. This work was supported in part byBasic Science Research Program through the National Research Foundationof Korea (NRF) funded by the Ministry of Education (NRF-2010-0025062 andNRF-2013R1A1A2011098). The associate editor coordinating the review ofthis paper and approving it for publication was M. Elkashlan. (Correspondingauthors: Do-Sik Yoo and Seong-Jun Oh).K. Moon, W. Lee, and S.-J. Oh are with Korea University, Seoul 136-701,Korea (e-mail: keith@korea.ac.kr; wlee@korea.ac.kr; seongjun@korea.ac.kr).D.-S. Yoo is with the Department of Electronic and Electrical Engineering,Hongik University, Seoul 121-791, Korea (e-mail: yoodosik@hongik.ac.kr).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TWC.2014.2374617to increase the energy efficiency and the network connectivitysimultaneously.In a wireless ad-hoc network, two nodes that are not directlyconnected may possibly communicate with each other throughso-called multi-hop communications [1], [2]. By employingmulti-hop communication, a node in a wireless ad-hoc networkcan extend its communication range through cascaded multihoplinks and eliminate some dispensable links to reduce the totalrequired power. Various efforts have been made to study howthe links must be maintained and how much power must be associatedwith each of those links for optimal network operationsdepending on the situation at hand. For example, Kirousis et al.[3] and Clementi et al. [4] studied the problem of minimizingthe sum power consumption of the nodes in an ad-hoc networkand showed that this problem is nondeterministic polynomialtime(NP) hard. Because the sum power minimization problemis NP hard, the authors in [4] proposed a heuristic solution forpractical ad-hoc networks. Ramanathan and Rosales-Hain, in[5], proposed two topology control schemes that minimize themaximum transmission power of each node with bi-directionaland directional strong connectivities, respectively. When thenumber of participating nodes is very large, it is crucial toreduce the transmission delay due to multi-hop transmissions.To maintain the total transmission delay within a tolerable limit,Zhang et al. studied delay-constrained ad-hoc networks in [6]and Huang et al. proposed a novel topology control scheme in[7] by predicting node movement.In [3]–[7], it was assumed that there exists a centralizedsystem controlling nodes so that global information such asnode positions and synchronization timing is known by eachnode in advance. However, such an assumption can be toostrong, especially in the case of ad-hoc networks. For thisreason, a distributed approach has been widely considered [8]–[11], where each node has to make its decision based on the informationit has collected from nearby neighbor nodes. Li et al.proposed a distributed topology control scheme in [8] andproved that the distributed topology control scheme preservesthe network connectivity compared with a centralized one.Because the topology control schemes in [3]–[8] guarantee onlyone connected neighbor for each node, the network connectivitycan be broken even when only a single link is disconnected.Accordingly, a reliable distributed topology control scheme thatguarantees at least k-neighbors was proposed in [9]. The resultin [9] was extended to a low computational complexity schemein [10], to a mobility guaranteeing scheme in [11], and to anenergy saving scheme in [12].1536-1276 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1859In [13], the concept of cooperative communications was firstemployed in centralized topology control, where it was shownthat cooperative communications can dramatically reduce thesum power consumption in broadcast network. Cardei et al.applied the idea of [13] to wireless ad-hoc networks in [14].Yu et al. further showed that cooperative communications canextend the communication range of each node with only amarginal increment in power consumption so that networkconnectivity is increased in an energy efficient manner [15],[16]. Because of these various advantages, the idea of cooperativecommunications has been widely considered in recentstudies on topology control to maximize capacity [17], improverouting efficiency [18], and mitigate interference from nearbynodes [19], [20]. The idea of cooperative communications inthese previous works [13]–[20] is realized in the followingway. First, a transmitting node sends a message to its neighbornodes (called helper nodes). After the helper nodes decode themessage, they (as well as the transmitting node in some cases)retransmit the message to a receiving node, and the receivingnode decodes the message by combining the signals frommultiple nodes. Therefore, strictly speaking, only the conceptof transmitter cooperation has been employed, and receivercooperation has not been considered.In this paper, we propose to employ the idea of receivercooperation in centralized topology control schemes, possiblyin combination with transmitter cooperation, to increase thenetwork connectivity in an energy efficient way. Consequently,we propose two centralized topology control schemes, onebased solely on receiver cooperation, and the other basedboth on transmitter and receiver cooperation. For comparisonwith proposed schemes, we consider a cooperative topologycontrol scheme in [16] that is based solely on transmittercooperation. We show, through extensive simulations, that wecan improve both network connectivity and energy efficiencyif we employ receiver cooperation in addition to transmittercooperation. We conclude that the system based both ontransmitter and receiver cooperation is generally superior tothat based only on transmitter cooperation. We also showthat the system based solely on receiver cooperation is asenergy efficient as one based both on transmitter and receivercooperation despite a slight decrease in network connectivity.Although the system based both on transmitter and receivercooperation achieves higher network connectivity than onebased only on receiver cooperation, we show that the additionalconnectivity increase requires significantly increased energyconsumption. For this reason, system designers may opt forreceiver-only cooperation, if energy efficiency is of the highestpriority or connectivity increase is no longer of seriousconcern.The remainder of this paper is organized as follows.In Section II, we describe the channel model consideredthroughout this paper. In Section III, we explain the topologycontrol scheme without cooperation that underlies the twocooperative topology control schemes considered in this paper.The two cooperative topology control schemes are then describedin Section IV. Furthermore, the performance of the twocooperative topology control schemes are numerically analyzedin Section V. Finally, we draw conclusions in Section VI.II. SYSTEM MODELIn this section, we describe the system model consideredthroughout this paper. We consider a network V ≡{v1, v2, . . . , vn} consisting of n nodes that are assumed to beuniformly distributed over a certain region in R2. The nodesare assumed to communicate with one another by transmittingsignals over a wireless channel with given bandwidth W. Weassume that the physical location of each node does not changewith time.To model a practical wireless channel, we assume that thepath loss PL(di j) between nodes vi and vj is given byPL(di j)[dB] = PLd0 +10k log_di jd0_+2loghi j +Xó+c. (1)Here, PLd0 is the reference path loss at unit distance d0 obtainedfrom the free space path loss model [21], and k denotes the pathloss exponent that represents how quickly the transmit powerattenuates as a function of the distance. The variables di j andhi j respectively denote the distance and the randomly varyingfast fading coefficient between vi and vj . In addition, Xó is arandom variable introduced to account for the shadowing effect.We assume that hi j and Xó vary independently from packetto packet, but remain constant during each packet duration.We assume further that h2i j follows a ÷2-distribution with twodegrees of freedom and Xó follows a normal distribution withzero mean and standard deviation ó. Finally, the variable c is theoffset correction factor between the mathematical model andfield measurement. We note that the values of PLd0 , d0, k, ó,and c vary depending on channel scenario, urban or suburban[22]. For given PLd0 , d0, k, ó, and c, when node vi transmitsa signal to node vj with power Pi, the received signal to noiseratio (SNR) ãi j(Pi) is given asãi j(Pi) =PiN0, jW×100.1×PL(di j), (2)where N0, j denotes the one-sided noise power spectral densityat vj . Throughout this paper, we assume that the maximumtransmit power of each node is given by Pmax.As the final issue in the system model, we briefly discuss networksynchronization. Communication in a completely asynchronousmanner is impossible, or at least be very difficult toachieve. In fact, synchronization can be a particularly importantissue in ad-hoc networks [23]–[25]. In this paper, we assumethat symbol level synchronization is maintained among participatingnodes. Although detailed synchronization techniquesare not the main focus of this paper, we briefly describe howthe issue of synchronization can be resolved with existingmethods. Synchronization techniques have been reported thatit can achieve time errors around 3 ∼ 7 μs. At such a level ofsynchronization, it will become desirable to maintain symbolduration longer than 50 μs, which corresponds to symbol rateof up to 20 kilo-symbols per second. A symbol rate of 20 kilosymbolswith rudimentary binary phase shift keying (BPSK)modulation results in a data-rate of only 20 kbps, which is notvery high. However, we can employ multi-carrier techniquessuch as orthogonal frequency division multiplexing (OFDM)to increase the data rate while maintaining or reducing thesymbol rate. For example, if we employ an OFDM system1860 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 1. A pictorial representation of G = (V,E) with V = {v1, . . . , v8} and E = {(v1;v2)NN, (v1;v3)NN, (v4;v5)NN, (v4;v6)NN, (v4;v7)NN}.with 512 subcarriers, the data rate can be increased to about10 Mbps using a simple BPSK sub-carrier modulation scheme.Consequently, even with existing techniques such as the OFDMscheme and synchronization algorithms proposed in [25], it ispossible to maintain the symbol-level synchronization requiredto implement the algorithms proposed in this paper.III. NODE-TO-NODE TOPOLOGY CONTROLIn this section, we explain a topology control scheme, whichwe refer to as the node-to-node topology control (NNTC)scheme, that is based solely on node-to-node communicationlinks. To describe the NNTC scheme, we first consider theconcept of a wireless communication link between two nodesand its related definitions. In this paper, a wireless link betweentwo nodes is said to exist if the received SNR exceeds a certainthreshold, meaning that the packet error probability is below acertain level (corresponding to the threshold). More formally,we say that there exists a node-to-node (N-N) link from node vito node vj if and only iff (ãi j(Pi)) ≤ áô, (3)for a certain transmit power Pi ≤ Pmax from vi. Here, f : R + →[0,1] denotes the packet error probability function associatedwith the given coding and modulation scheme and áô is thegiven threshold on the packet error probability, which we callthe error threshold hereafter. We assume that f is a monotonicallydecreasing continuous function and that all the nodesshare the same packet error probability function f .1When there exists a uni-directional N-N link from vi to vj ,the power Pi that satisfies (3) with equality, which we denoteby PNN(vi → vj), is called the minimum N-N routable powerof N-N link from vi to vj . We note that PNN(vi → vj) directlyfollows from the definition thatPNN(vi →vj) =N0, jW f−1(áô)100.1×PL(di j). (4)If both the uni-directional N-N links from vi to vj and from vjto vi exist, we say that there exists an N-N bi-directional link,or simply an N-N link between the two nodes vi and vj that1In many previous works on topology controls [14]–[17], (3) is equivalentlywritten as ãi j(Pi) ≥ SNRô ≡ f−1(áô). However, to consider the receivercooperation scheme in a unified framework, we directly consider the packeterror probability function f .we denote by (vi; vj)NN. The minimum N-N round-trip powerPNN(vi, vj) of the bi-directional N-N link (vi; vj)NN is definedas the sum of the two uni-directional minimum N-N routablepowers, namely, asPNN(vi, vj) = PNN(vi →vj)+PNN(vj →vi). (5)We note that there are some situations in which two nodes viand vj can communicate with each other even if there is no NNlink between vi and vj . For example, we consider the case inwhich there are two N-N links (v1; v2)NN and (v1; v3)NN. In thiscase, v2 and v3 can exchange a message through v1 even if thereis no N-N link between v2 and v3. To route a message throughmultiple N-N links, all available N-N links should be knownto the nodes. To reduce the routing complexity, only some ofthe existing N-N links are used for communications in practice.By eliminating redundant links, we can simplify the messagerouting protocol and save power consumed for exchangingreference signals such as pilot and channel information [26],[27].We denote the set of N-N links to be used for routing by E.Consequently, (vi; vj)NN ∈ E means that there exists N-N link(vi; vj)NN and this N-N link is to be used for routing. Here, wenote that (vi; vj)NN /∈E does not necessarily mean that there isno N-N link between vi and vj . In graph theory, the combinationG = (V,E) of V and E is called a graph with vertex set V andedge set E. In the remainder of this paper, nodes and links shallalso be referred to as vertexes and edges, respectively.For a given E, if (vi; vj)NN ∈ E, vi is said to be a neighborof vj and vice versa. We denote by N(vi|E) the setof neighbors of vi. For illustration, we consider the graphG = (V,E) with V = {v1, v2, . . . , v8} and E = {(v1; v2)NN,(v1; v3)NN, (v4; v5)NN, (v4; v6)NN, (v4; v7)NN}, which compactlydescribes the situation in Fig. 1. In this example, v5, v6 andv7 are neighbors of v4, therefore, N(v4|E) = {v5, v6, v7}. Here,we note that v5 is not a neighbor of v7, however, it is possiblefor v5 to send a message to v7 if (v4; v5)NN and (v4; v7)NNare cascaded. Likewise, if vi and vj can send a message bidirectionallyusing a single or cascaded multiple N-N edges, wesay that vi and vj are connected by N-N edges. The maximal setof nodes connected by N-N edges in E is referred to as a cluster.For notational convenience, a given cluster {vi1 , vi2 , . . . , vim} isdenoted by Ùmax{i1,i2,…,im}. For instance, in Fig. 1, there arethree clusters {v1, v2, v3}, {v4, v5, v6, v7}, and {v8}, which aredenoted by Ù3, Ù7, and Ù8, respectively. As shown in thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1861Fig. 2. Steps to construct the edge set E for a given node distribution V. (a) Identification of all N-N links. (b) A typical example of a spanning forest ofGL = (V,L).example, several clusters can exist for a given graph.We denotethe set of all clusters by V . We note that V = {Ù3,Ù7,Ù8} inthe above example.We now describe precisely how the set E of N-N edges tobe used for routing in the NNTC scheme is constructed. For agiven node set V, the set L of all existing N-N links and the setV of clusters defined by the graph GL = (V,L) are identified.Next, the edge set E is defined as a subset of L such that thegraph G = (V,E) also leads to the same cluster set V as graphGL = (V,L). Several candidate algorithms exist that can buildE such as breath-first search (BFS) [28] and depth-first search(DFS) [29]. In this paper, we use the minimum-weight spanningforest (MSF) algorithm that aims to build a sparse edge setusing the optimal average power required for network structureconstruction [1], [8], [15], [16]. In the MSF algorithm, first a setTÙ called a minimum spanning tree (MST), is defined for eachcluster Ù ∈ V . After obtaining all the MSTs, the set FV , calledthe minimum spanning forest of V, is defined as the union of allthe MSTs, namely, asFV = _Ù∈VTÙ, (6)which is defined to be edge set E in the NNTC scheme.It now remains to describe how the MST TÙ is obtained foreach cluster Ù ∈ V . If Ù is a singleton, then TÙ is defined to bethe empty set /0. If Ù contains more than one node, to obtain TÙ,it is necessary to consider the set L|Ù of all edges that connectnodes in Ù. For instance, we consider the example depictedin Fig. 2(a) in which the network consists of three singletonclusters and nine non-singleton clusters. For a non-singletoncluster Ù encircled by a red colored line, the edge set L|Ù isdefined as the set of all edges inside the red circle. We call asubset T of L|Ù a spanning tree of Ù if and only if there are nocycles (loops) in T and if any two nodes in Ù are connected byedges in T. For example, the edge set of each cluster depictedin Fig. 2(b) is a spanning tree of that cluster. Among all theexisting spanning trees of Ù, the one that leads to the minimumedge-weight sum is referred to as the MST TÙ of Ù. Here, theminimum N-N round-trip power PNN(vi, vj) of the N-N link isused for the weight of each edge (vi, vj)NN ∈ L.We note that transmission through the link in FV is not completelyerror-free, but has a packet error probability of áô. However,in the following, we assume that the communication linkin FV is error-free, possibly with the help of an automatic repeatand request (ARQ) scheme. Clearly, the repeated transmissionwill consume additional energy. However, even with the simplestARQ scheme, the average required energy to complete asuccessful transmission is increased from a single transmission(with packet error rate áô) by a factor of 1/(1−áô) [30]. Wenote that the factor 1/(1−áô) is reasonably close to 1 if áô ischosen to be small, say, less than 0.1. Therefore, if áô is sufficientlysmall, the additional cost for error-free communicationis only a small fraction of the total cost and hence is negligible.IV. COOPERATIVE TOPOLOGY CONTROLWe note that inter-cluster communication, namely, communicationbetween nodes belonging to different clusters is notpossible solely through cascaded N-N links. To make interclustercommunications possible, [16] employed the idea oftransmitter cooperation in which multiple nodes in one clustersimultaneously transmit the same message to a single node inanother cluster. In [16], to keep the additional complexity due tothe employment of cooperative transmission manageable, it wasassumed that a pair of nodes belonging to two communicatingclusters were pre-assigned so that communications between thetwo clusters could only happen between these two nodes withthe help of nodes in their neighborhoods. We note that notonly the neighboring nodes around the transmitting node butalso the nodes around the receiving node can help to establishinter-cluster communications. Consequently, in this paper, wepropose to employ receiver cooperation in which the interclustercommunication is regarded as successful if the receivingnode or any of the neighboring nodes succeeds in receiving the1862 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015message correctly. If the neighboring nodes only around thereceiving node participate in the cooperation, the establishedlink between two clusters is referred to as the node-to-cluster(N-C) link. Furthermore, if neighboring nodes around boththe transmitting and receiving nodes participate in the linkestablishment, the inter-cluster communication link is calledcluster-to-cluster (C-C) link.In this section, we describe two centralized cooperativetopology control schemes based on N-C and C-C links thatare referred to as node-to-cluster topology control (NCTC) andcluster-to-cluster topology control (CCTC) schemes, respectively.In each of these cooperative topology control schemes,cooperative links are employed to connect the clusters obtainedfrom the graph G = (V,E) described in Section III. Consequently,the network configuration defined in a cooperativetopology control scheme is described by four sets, namely, theset V of nodes, the set E of edges used for routing in the NNTCscheme, the set V of clusters defined by the graph G = (V,E),and the set E of cooperative edges. For this reason, the networkconfigurations defined in the NCTC and CCTC schemes areidentified by GNC = (V,E,V ,ENC) and GCC = (V,E,V ,ECC),respectively. Here, ENC and ECC consist only of N-C and C-Cedges, respectively.A. NCTCIn this subsection, we describe how the network configurationGNC = (V,E,V ,ENC) corresponding to the NCTC schemeis defined. Given graph G = (V,E) and corresponding clusterset V , the edge set ENC is obtained in three steps. First, theset LNC of all N-C links connecting clusters in V is identified.Next, for each N-C link in LNC, the weight of the link isdefined as the minimum power required to establish it. Finally,the desired edge set ENC is defined as the MSF of the graphGLNC = (V ,LNC).To describe the NCTC scheme, we first define the nodeto-cluster (N-C) link. For more concrete understanding ofN-C link, we consider a simple example of receiver cooperationbetween two clusters Ù3 ={v1, v2, v3} and Ù7 ={v4, v5, v6, v7}.For illustration, we assume that the inter-cluster communicationlink between two clusters is established if the error probabilityis less than or equal to 0.1. We assume that the decoding errorprobabilities at nodes v4, v5, v6 and v7 are, respectively, given as0.3, 0.4, 0.8, and 0.9 when v1 sends a message with maximumpower. Consequently, node v1 and a node in Ù7 cannot establishinter-cluster communications between Ù3 and Ù7 throughN-N links. However, if any of the nodes in Ù7 succeed incorrectly decoding the message, the message can be routed toany of the desired nodes in Ù7. If such receiver cooperation isemployed, communication fails only when all four nodes v4, v5,v6, and v7 fail to decode the message at the same time. We notethat such a probability is 0.3×0.4×0.8×0.9 = 0.0864 < 0.1.For this reason, we say that cooperative communication linkbetween Ù3 and Ù7 is established.In the above example, all nodes in the receiving cluster tryto decode the transmitted message. However, if the size of thereceiving cluster is large, the routing protocol and maintenancecost can become very burdensome. For this reason, we assumethat a certain receiving node and its one hop neighbors participatein the receiver cooperation. To be more precise, for a givenpair of clusters, a certain node is selected from each clusterand the signal is assumed to be transmitted from either of thesetwo nodes and then received by the other node and its one-hopneighbors.We note that there exists a more aggressive method of receivercooperation than the one described above. For example,the bridge node can achieve a huge combining gain if thehelper nodes transmit observed soft information rather thandecoded bits. However, the transmission of the observed datagenerally consumes large amount of energy and bandwidth.Consequently, a sufficiently fine quantization must be consideredto employ soft combining. Because this problem ishighly complex, we assume in this paper that the helper nodesdecode the message and deliver it to the bridge node. However,considering the importance of this problem, serious researchemploying soft combining schemes should be pursued.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from the given graph G = (V,E) defined inthe NNTC scheme. We formally define the concept of an N-Clink as follows.Definition 1: Let vbl∈ Ùl and vbm∈ Ùm. Then, we saythat there exists a bi-directional N-C link, or simply, a N-Clink denoted by (vbl ,N(vbl|L); vbm,N(vbm|L))NC between Ùland Ùm, if and only ifÐvr∈{vbm}∪N(vbm|L)f_ãbl r(Pbl )_≤ áô (7)andÐvr∈{vbl}∪N(vbl|L)f (ãbmr(Pbm)) ≤ áô (8)for some Pbl≤ Pmax and Pbm≤ Pmax.Here, L denotes the set of all N-N links described inSection III. In other words, all one-hop neighbors of the receivingnode are assumed to participate in receiver cooperationregardless whether they belong to E. We note that the errorprobability between helper and bridge node is assumed tobe zero, as mentioned in Section III. For a given N-C link(vbl ,N(vbl|L); vbm,N(vbm|L))NC, nodes vbl and vbm and setsN(vbl|L) and N(vbm|L) are called the bridge nodes and helpersets, respectively.In Definition 1, we note that the sum of the Pbl and Pbm valuesthat satisfy (7) and (8) with equality is the minimum total transmissionpower required to make round-trip communicationbetween Ùl and Ùm through (vbl ,N(vbl|L); vbm,N(vbm|L))NC.Because the sum Pbl +Pbm depends on the choice of the N-Clink, it is natural to choose the N-C link that minimizes the sumpower Pbl +Pbm. The minimized sum power shall be referred toas the minimum N-C round-trip power and the correspondingN-C link as the minimum power N-C link between Ùl and Ùm.We denote by PNC(Ùl ,Ùm) the minimum N-C round-trip powerbetween Ùl and Ùm.We now describe how we establish communications betweenÙl and Ùm. First, let vbl∈ Ùl and vvm∈ Ùm be the bridgenodes of the minimum power N-C link between Ùl and Ùmand let Hl and Hm be the helper sets of the link. We nowMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1863Fig. 3. Steps to construct the edge set ENC for the given graph G = (V,E). (a) Identification of all N-C links. (b) A typical example of a spanning forest ofGLNC = (V ,LNC).assume that a source node vs in Ùl −{vbl} attempts to senda message to destination node vd in Ùm −{vbm}. In this case,vs sends the message to bridge node vbl through cascaded N-Nedges, and then bridge node vbl transmits the message to Ùm.The message sent from vbl is then decoded at bridge nodevbm and all the nodes in the helper set Hm. Because of thedefinition of the N-C link, the message must be decoded, withnegligible failure rate, at least at one node in {vbm} ∪ Hm.Because Hm consists only of the one hop neighbors of vbm, thenodes that successfully decode the message can be determinedby vbm with little overhead. After determining the nodes thatsuccessfully decoded the message, vbm delivers the message totarget destination node vd through the cascaded N-N edges.Finally, we describe how the edge set ENC is constructedin the NCTC scheme. First, the minimum power N-C link isidentified for each pair of clusters between which N-C linksexist. Let LNC denote the set of the minimum power N-C linksobtained as the result. For each (vbl ,Hl ; vbm,Hm)NC ∈ LNC,the weight is then defined as the corresponding minimum N-Cround-trip power. After computing all the weights of LNC, thesparse edge set ENC is defined as theMSF of GLNC =(V ,LNC).Note that the MSF construction procedure described inSection III can be directly applied here by substituting V andL with V and LNC, respectively. In Fig. 3, the procedure isillustrated. For instance, Fig. 3(a) indicates all the minimumpower N-C links between clusters by solid red lines andFig. 3(b) illustrates the shape of a typical spanning forest thatdoes not include any loops. Likewise, after finding all thespanning forests of GLNC = (V ,LNC), the one that minimizesthe sum weight is defined as the MSF ENC. After obtainingthe ENC, the desired final graph GNC = (V,E,V ,ENC) for theNCTC scheme is constructed.B. CCTCIn this subsection, we describe the CCTC scheme and explainhow the network configuration GCC = (V,E,V ,ECC) correspondingto the CCTC scheme is defined. We first explainthe concept of a cluster-to-cluster (C-C) link and the relatedrouting protocol with a simple example. We assume that sourcenode vs ∈ Ùl attempts to send a message to destination nodevd ∈ Ùm. In this case, vs sends a message through cascadedN-N edges to a pre-defined bridge node vbl . After receivingthe message, vbl disseminates the message to the nodes in apre-defined helper set Hl . After decoding the message, vbl andvhl∈ Hl simultaneously transmit the message to Ùm in thenext time frame. In Ùm, a pre-defined bridge node vbm andthe nodes in a pre-defined helper set Hm attempt to decode themessage with the multiple signal replicas from the transmitters.If the maximum ratio combiner (MRC) [31] is employed at thereceiving node vr ∈ {vbm}∪Hm, the combined average receivedSNR ¯ãr at vr can be written as¯ãr = ãbl r(Pbl)+ Óvhl∈Hlãhl r(Phl ), (9)and the decoding error probability at vr is given as f (¯ãr). Toestablish the symbol combining in (9), the same signals fromthe multiple transmitters should be received at the same timeas assumed in [13]. We note that problems related to timesynchronization were discussed in Section II. Similarly to thecase for N-C links, we say that the message is decodable, withnegligible failure rate, at least at one node in {vbm}∪Hm ifÐvr∈{vbm}∪Hmf (¯ãr) ≤ áô (10)with small enough áô, where f (·) denotes the common packeterror probability function for given received SNR, as defined inSection II. If the inequality (10) holds, we say that there exists aC-C link from Ùl to Ùm. Once the message is decoded at nodesin {vbm}∪Hm, the message is delivered to destination node vdthrough cascaded N-N edges to complete the routing procedure.To maintain the C-C link power efficiently, it is necessaryto choose appropriately the node pair (vbl , vbm), the helper set1864 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015(Hl ,Hm), and the transmission power from each transmittingnode to minimize the power consumption. However, the computationalcomplexity makes such an optimization algorithmhardly feasible not only in practical systems but also in simulationenvironments [14]. For this reason, it is widely assumedthat nodes participating in transmitter cooperation use the samepower [15], [16]. Consequently, we adopt the same assumptionwhen designing the CCTC scheme.For a more formal description, we consider two non-emptyclusters Ùl and Ùm from a given graph G = (E,V). We definethe concept of a C-C link in the following definition.Definition 2: Let vbl∈ Ùl , vbm∈ Ùm, Hl ⊂ N(vbl|L), andHm ⊂ N(vbm|L). Then, we say that there exists a bi-directionalC-C link, or simply, a C-C link denoted by (vbl ,Hl ; vbm,Hm)CCbetween Ùl and Ùm if and only ifÐvr∈{vbm}∪Hmf⎛⎝ãbl r(Pcl)+ Óvhl∈Hlãhl r(Pcl )⎞⎠ ≤ áô, (11)andÐvr∈{vbl}∪Hlfãbmr(Pcm)+ Óvhm∈Hmãhmr(Pcm)_≤ áô (12)for some Pcl≤ Pmax and Pcm≤ Pmax.Here, Pcl and Pcm denote the common transmission powersof transmitting nodes in Ùl and Ùm, respectively. For a givenC-C link (vbl ,Hl ; vbm,Hm)CC, the nodes vbl and vbm are calledthe bridge nodes and the sets Hl and Hm are called the helpersets between Ùl and Ùm. Such terminology is the same for ofN-C links. However, in the case of C-C links, the nodes in thehelper set participate not only in receiver cooperation but alsoin transmitter cooperation.In Definition 2, we note that the total transmission powerminimally required to make round-trip communication betweenÙl and Ùm is given by (|Hl |+1)Pcl +(|Hm|+1)Pcm using thevalues for Pcl and Pcm that satisfy (11) and (12) with equality.Here, |X| denotes the cardinality of set X. We also note that therequired total transmission power (|Hl |+1)Pcl +(|Hm|+1)Pcmvaries depending on the choice of the C-C link. Consequently, itis natural to choose the C-C link that leads to the smallest totalrequired transmission power. The smallest total required transmissionpower and the corresponding C-C link are referred to asthe minimum C-C round-trip power and minimum power C-Clink between Ùl and Ùm, respectively. We denote the minimumC-C round-trip power between Ùl and Ùm by PCC(Ùl ,Ùm).We now describe how the edge set ECC is constructed inthe CCTC scheme. We note that the procedure for obtainingECC is essentially the same as that for obtaining ENC. Therefore,we describe it with brevity. First, the set LCC of all theminimum power C-C links between clusters is identified. Foreach (vbl ,Hl ; vbm,Hm)CC ∈ LCC, the weight is defined as thecorresponding minimum C-C round-trip power. After computingall the weights of LCC, the sparse edge set ECC is definedas the MSF of GLCC = (V ,LCC). After obtaining ECC, thedesired final graph GCC = (V,E,V ,ECC) for CCTC scheme isconstructed.Next, we briefly remark on the additional receiver processingcosts required for the NCTC and CCTC schemes. Comparedto the transmitter cooperative topology control scheme in [16],additional decoding power is required in the NCTC and CCTCschemes because of multiple-node decoding. This additionaldecoding increases not only the power consumption, but alsothe overall system complexity. Furthermore, each receivinghelper node should report the received message decodabilityto the bridge node, which increases system overhead. Thereare some analytical studies on receiving power consumption[32], [33] and overhead [34] because it could be a critical issuein the case of ad-hoc networks. However, we note that thedecoding power consumption and related overhead are heavilydependent on the receiving strategy. For example, one can chosea receiving strategy in which the receiving helper nodes decodethe message in the order of channel conditions until a successfuldecoding node appears. In this case, the average decodingpower consumption and system complexity can be reduced.In addition, the serach for the optimal receiving strategy ishighly non-trivial and requires serious and independent study.However, despite its importance, in this primary effort ontopology control, we do not consider such issues any furtherto keep the problem tractable.Finally, we briefly consider the impact of mobility on the proposedtopology control schemes. Unfortunately, the proposedschemes are basically inapplicable except when the mobilityis very low. When a node moves, three situations can happen.First, in some situations in which only minor movement isinvolved, there may be no changes in the network topologyexcept for the configurations inside the cluster to which themoved node belongs. Second, in other situations, the clusterto which the moved node originally belonged, must be dividedinto more than one cluster. Finally, in still other situations,some clusters could be unified into one cluster by the N-Nlinks newly defined by the node movement. In the first case,the mobility problem is relatively simple. If the moved node isnot a bridge or helper node, the moved node could be simplyattached to the nearby cluster. On the other hand, if the movednode is a bridge or helper node, the bridge and/or helper nodesof the corresponding cooperative link are changed to one of thealternatives among the pre-stored alternative bridge and helpernodes. However, if there is no alternative bridge and/or helpernode or if the second or the third situation occurs, clusters andcooperative edges should be redefined. In addition, if severalnodes move at the same time, the second and third situationsmay happen more frequently and this is why the proposedschemes are applicable only when the mobility is very low.V. PERFORMANCE EVALUATIONAND NUMERICAL RESULTSIn this section, we analyze through simulations the performanceof the two proposed centralized topology controlschemes, namely, the NCTC and CNTC schemes, and comparethem to the NNTC scheme and cooperative topology controlscheme in [16] that is based solely on transmitter cooperation.For convenience, we call the topology control scheme in[16] the cluster-to-node topology control scheme (CNTC). ToMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1865TABLE ISIMULATION CONFIGURATION PARAMETERSour best knowledge, the CNTC scheme achieves the highestconnectivity with a power requirement that is onl marginallygreater than other existing topology control schemes. In thissection, we show that the proposed NCTC scheme providesbetter energy efficiency with marginal connectivity loss and theCCTC scheme allows both better energy efficiency and higherconnectivity than the CNTC scheme.A. Simulation ConfigurationThe system performance is evaluated through simulations inthis paper. Although analytic evaluation is generally more desirable,the performance of topology control schemes is very hardto analyze. To the best of our knowledge, only some analyticalresults have been obtained for the case of non-cooperativecommunications among an infinite number of nodes [35], [36]and previous studies [13]–[16] on cooperative topology controlschemes have only been evaluated through numerical simulations.For this reason, we study the performance throughsimulations. However, we provide partial analytical reasoningwhenever possible. Furthermore, to improve the value of theresults, we reflect practical situations as much as possiblein simulation configuration by employing channel parametersbased on actual field measurement [22] and the design parametersin the 3GPP standard [37].To describe the system configuration used for performanceevaluation, we need to specify the values of various parameters,which we divide into two categories: channel parameters andsystem design parameters. The channel parameters includethe reference path loss PLd0 , path loss exponent k, shadowingrandom variable Xó, offset correction factor c, and noisepower spectral density N0,i. First, we assumed that N0,i, i =1, . . . ,n, were identically given as −174 dBm/Hz, the noisepower spectral density at the room temperature. For the otherchannel parameters PLd0 , k, Xó, and c, we consider two setsof values, given in Table I, that represent suburban and urbanscenarios [22].The system design parameters considered in this section arethe number of nodes n, simulation area A, error threshold áô,packet error function f , and maximum transmit power Pmax.Parameters n and A are closely related to the node density,which determines the number of nodes participating in thecooperation. Therefore, we varied n and A to observe how theperformance is influenced by the node density. The choice oferror function f depends on the error correction coding schemeemployed. In this study, we assume that a convolutional codewith a constraint length of two is used as the error correctioncoding scheme with a packet length of 1,024 [38]. Hence, weused the actual packet error rate obtained through extensivesimulations with the aforementioned convolutional code for thepacket error function f . For the choice of áô, we used 10−2,a value often adopted as the target packet error rate in manysituations. Finally, we assumed that the node power Pi is limitedby Pmax = 250 mW, and Pi is uniformly distributed over a10 MHz bandwidth. Detailed values of the above channel andsystem parameters are summarized in Table I.B. ConnectivityTo compare the level of performance achievable with theproposed topology control schemes, we first consider a metriccalled connectivity to measure the average proportion of nodesconnected to a node. Before proceeding with the formal definitionof metric connectivity, we observe that the performance ofa given topology control scheme depends not only on the valuesof n and A but also on the distribution of these n nodes over areaA. For this reason, we assume that n(≥ 2) nodes are randomlyand uniformly distributed over a given area A in the followingdiscussion.To formally define connectivity, we first denote the set ofall nodes connected to node vi by R(vi). We note that the setR(vi) depends on the choice of topology control schemes. Forinstance, in the NNTC scheme, R(vi) is the set of all nodesconnected to vi by an N-N edge. On the other hand, in acooperative topology control scheme, R(vi) consists of all thenodes that are connected through cascaded N-N and cascadedcooperative edges. Therefore, the connectivity Ã (of a giventopology control scheme) is defined asÃ =1nE_nÓi=1|R(vi)|n−1, (13)where |R(vi)| denotes the cardinality of R(vi). Here, the expectationE[·] has been taken because the cardinality |R(vi)|depends on how the nodes are distributed over a given area.We note that R(vi)/(n − 1) is the proportion of nodes thatare connected to vi and hence Ã is the expected value of itsarithmetic mean. For notational convenience, the connectivitiesof CCTC, NCTC, NNTC, and CNTC schemes are denoted byÃCC, ÃNC, ÃNN, and ÃCN, respectively.In Fig. 4, the connectivity for various topology controlschemes is shown as a function of the number of nodes nfor three different areas and two different environments. Mostimportantly, we observe that ÃCC ≥ ÃCN ≥ ÃNC ≥ ÃNN forall values of n and A and for any environment considered.We clearly see that either transmitter or receiver cooperationimproves connectivity. The fact that the CCTC scheme achievesthe highest connectivity is hardly surprising, hence what weactually need to observe is how the NCTC and CNTC schemes1866 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 4. Connectivity as a function of the number of nodes for various topology control schemes in various communication environments. (a) Urban. (b) Suburban.perform in comparison to it. In particular, since ÃCN ≤ ÃNC,we conclude that transmitter cooperation is more effective thanreceiver cooperation at achieving connectivity.C. Power ConsumptionSo far, we have observed that the CCTC scheme achievesthe highest connectivity and that the connectivity gap betweenthe CNTC and CCTC schemes is not large. In fact, it is notmore than 8% in most cases. Consequently, it is possible tosay that the CNTC scheme is a good alternative to the CCTCscheme if we consider connectivity only. However, the CNTCscheme is not as efficient as the CCTC scheme in terms ofpower consumption. Before proceeding with the analysis ofpower consumption, we define ˆ ECC to be the set of cluster pairscorresponding to the edges in ECC. In other words, (Ùl ,Ùm) ∈ˆ ECC, if and only if the edge set ECC contains the C-C edgebetween Ùl and Ùm. In a similar way, we denote the sets of thecluster pairs corresponding to edges in ENC and ECN by ˆ ENCand ˆ ECN, respectively.To quantitatively compare the power consumption of theCCTC and CNTC schemes, we now consider the following twoquantities¯PCC =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCC(ð)⎤⎦ (14)and¯PCN =1nE⎡⎣ Óð∈ˆECC∩ˆECNPCN(ð)⎤⎦, (15)where PCN(ð) denotes the minimum C-N round-trip powerbetween the pair ð of clusters, similarly to PCC(ð) and PNC(ð)as defined in Section IV.We note that these quantities representthe average power required per each node to establish cooperativeedges between clusters in ˆ ECC ∩ ˆ ECN. Consequently, bycomparing ¯PCC and ¯PCN, we intend to compare the power requiredfor the CCTC and CNTC schemes to establish commoncooperative edges.Before proceeding with the evaluation of ¯PCC and ¯PCN, wefirst note that the two sets ˆ ECC− ˆ ECN and ˆ ECN − ˆ ECC of clusterpairs are not necessarily empty. Because the CCTC schemeemploys receiver cooperation in addition to transmitter cooperation,it appears reasonable to expect ˆ ECC − ˆ ECN to containsome sizable number of cooperative edges and ˆ ENC − ˆ ECC tobe empty. In fact, the average number of elements in ˆ ECC −ˆ ECN reaches as much as 25% of that of ˆ ECC ∩ˆ ECN in manysituations. However, interestingly, ˆ ECN − ˆ ECC is not necessarilyempty. This is because of the employment of MSF algorithm,that removes some redundant links. In other words, in CCTCschemes, some links used in the CNTC scheme are eliminatedby applying the MSF algorithms in some rare situations. Fromour numerical analysis, we found that the average cardinalityof ˆ ECN − ˆ ECC sometimes reaches as much as 8% of that ofˆ ECC∩ ˆ ECN. However, in most cases, the set ˆ ECN− ˆ ECC is emptyand hence ˆ ECC ∩ ˆ ECN is the same as ˆ ECN.Fig. 5(a) illustrates how the values of ¯PCC and ¯PCN changeas a function of the number of nodes n. We note that ¯PCCfirst increases as n increases and then decreases after n reachesa certain value. A similar tendency can be found in ¯PCN. Toexplain this non-monotonic performance of ¯PCC and ¯PCN, wedefine two quantitiesFCC =E_Óð∈ˆECC∩ˆECNPCC(ð)_E_|ˆ ECC ∩ ˆ ECN|_ (16)andFCN =E_Óð∈ˆECC∩ˆECNPCN(ð)_E_|ˆ ECC ∩ ˆ ECN|_ , (17)MOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1867Fig. 5. The average additional power required per each node to establish cooperative edges in CCTC and CNTC schemes. (a) ¯PCN and ¯PCC. (b) ¯PCN over ¯PCC.to describe the average power consumed to establish a C-C linkand a C-N link, respectively. As a result, ¯PCC and ¯PCN can berewritten as¯PCC =1n·FCC ·N (18)and¯PCN =1n·FCN ·N , (19)where N = E[|ˆ ECC ∩ ˆ ECN|].While we cannot provide fully analytical behaviors of thequantities ¯PCC and ¯PCN, which is very difficult, it will be meaningfulto consider their qualitative behaviors. First, we note thatthe quantities FCC and FCN are mainly affected by the distancebetween clusters. It is natural to expect that the average clusterto-cluster distance will decrease with an increased number ofnodes n. However, the average cluster-to-cluster distance decreasesas a very slowly varying function of n, particularly aftern reaches a certain critical value. This is because two clustersare merged into one if the distance between them becomes tooclose. As a consequence, FCC and FCN decrease very slowly asn increases. For example, the minimum observed value of FCCwas only about 25% lower than the maximum observed value inthe simulation performed for an urban 2 × 2 km situation wheren ranged from 10 to 100. Because the quantities FCC and FCNare relatively unaffected by the variation of n, the behaviors of¯PCC and ¯PCN can possibly be accounted for by the behaviors ofthe average number of elementsN in ˆ ECC∩ ˆ ECN, which, in fact,varies very significantly as n varies. Let us observe, when thenode density is sufficiently low, that N increases as n increases,since increased n results in an increased number of clustersand then in an increased number of edges. However, when thenode density is high enough, adding nodes no longer makes thenumber of clusters larger because the addition of nodes nowresults in cluster unification. For this reason, N first increasesup to a certain critical value of n and then decreases againas n grows further. However, it is very difficult to predict thebehavior of N in a fully analytical manner, since N depends ontoo many factors such as node distribution, channel and fadingmodels, error probability function, and so on. As far as weknow, only a few analytical results [35], [36] have been derivedfor non-cooperative communications with an infinite number ofnodes and none for general cases or cooperative environments.We now discuss the simulation results of comparing ¯PCC and¯PCN. Because FCC and FCN vary slowly as functions of n, thevariations of ¯PCC and ¯PCN are dominantly determined by 1/nand N . When n = 10, N is almost zero since a very smallnumber of clusters exist and they are located too far away.As n increases up to a certain value, the number of clustersincreases so that the chance of cooperative communication alsoincreases. In this region, N grows faster than n, therefore, ¯PCCbecomes larger. On the other hand, if n exceeds a certain value,the number of clusters decreases, and eventually, it goes to one.Therefore, N quickly converges to zero with growing n, andthis is why ¯PCC decreases. In Fig. 5(a), we next observe that ¯PCCis always smaller than ¯PCN. To quantify the difference betweenthe two values, we illustrate the values of ¯PCN/¯PCC in Fig. 5(b),where we clearly see that ¯PCN is about 10–100% larger than¯PCC. From this figure, we clearly see that the CCTC schemerequires significantly less power than the CNTC scheme toestablish the same cooperative edges.Here, the question arises as to how the NCTC schemecompares to the CCTC scheme in terms of power consumption.First, we can compare the amount of power required for theCCTC and NCTC schemes to establish common cooperativeedges. In a similar comparison in Fig. 5, we noted that ¯PCCis significantly smaller than ¯PCN. However, in the case of theCCTC and NCTC schemes, there is virtually no differencebetween the powers required to establish common cooperativeedges. This is related to the assumption that the nodesparticipating in the cooperative transmission use the same1868 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 14, NO. 4, APRIL 2015Fig. 6. The relative amount of power required to establish one more additional cooperative edge with the CCTC scheme in comparison with the NCTC andCNTC schemes. (a) Urban. (b) Suburban.transmission power as in CCTC scheme. Because of this constrainton the transmission power, only one node is selected,even in the CCTC scheme, to transmit signals almost alwayswhenever the cooperative edge is contained in both ˆ ECC andˆ ENC. Therefore, it can be said that the NCTC scheme is almostas efficient as the CCTC scheme in terms of power consumption.Consequently, if the connectivity is of less priority thanthe power consumption or if the situation is such that theconnectivities of CCTC and NCTC are almost the same valuesbecause of a very high node density, the NCTC scheme can beconsidered to be a good alternative to the CCTC scheme. This isparticularly so because the average power required to establisha cooperative edge in ˆ ECC− ˆ ENC is significantly larger, in manycases, than the power required to establish cooperative edgein ˆ ENC.To illustrate this, we consider the metric ñCCNC defined asñCCNC =DCCNCKNC(20)in whichDCCNC =E_Óð∈ˆECC−ˆENCPCC(ð)_E_|ˆ ECC − ˆ ENC|_ (21)andKNC =E_Óð∈ˆENCPNC(ð)_E_|ˆ ENC|_ . (22)We note that DCCNC denotes the power required to establish oneC-C link that can not be established in NCTC scheme andthat KCCNC is the power consumption required for one N-C link.Consequently, the metric ñCCNC measures the relative amount ofpower required to establish one more additional cooperativeedge using the CCTC scheme in comparison to the NCTCscheme. In a similar manner, we define the metric ñCCCN byñCCCN =E_Óð∈ˆECC−ˆECNPCC(ð)_E_|ˆ ECC − ˆ ECN|_ ÷E_Óð∈ˆECNPCN(ð)_E_|ˆ ECN|_ (23)=DCCCNKCN(24)to quantify the relative amount of power required to establishone more additional cooperative edge using the CCTC schemein comparison to the CNTC scheme.In Fig. 6, we plot ñCCNC and ñCCCN as functions of n. Here,we first observe that the numerical values of ñCCNC and ñCCCNare around 3 and 1.2, respectively, for all cases considered.We note that, as mentioned in the explanation of Fig. 5, thepower consumed to establish a single cooperative link decreaseswith growing n so that DCCNC, DCCCN, KNC, and KCN are alldecreasing functions of n. In addition, we note that the powerrequired to establish a cooperative link is mainly affected bythe number of transmitting nodes and the transmitting powerof each node. We also note that the cooperative link betweentwo clusters is established by only a small number of nodeslocated near the boundary of each cluster, even when the clustersize is very large. This means that the number of transmittingnodes is almost constant, regardless of n. Therefore, the rateof decreasing power consumption is primarily affected by thetransmitting power of each node, which is closely related to thedistance between clusters. Because the configuration of clustersis identically given by the NNTC scheme, as n increases, thedecreasing rate of the power required to establish cooperativelinks is relatively similar for all three cooperative schemes,namely, the NCTC, CNTC, and CCTC schemes. For thisMOON et al.: RECEIVER COOPERATION IN TOPOLOGY CONTROL FOR WIRELESS AD-HOC NETWORKS 1869reason, the ratios DCCNC/KNC and DCCCN/KCN remain roughly thesame regardless of the value of n.We next observe that the values of ñCCNC, plotted by solid purplelines, are always around three. This means that to establishan edge that cannot be established in the NCTC scheme, theCCTC scheme requires about three times the power requiredto establish an edge in the NCTC scheme, regardless of thescenario and node density considered. Combining this resultwith the connectivity result in Fig. 4, we gain an importantinsight into the system design. When n = 50, the connectivityof the CCTC scheme is almost twice that of the NCTC scheme.Therefore, a three-fold increase in power consumption could bea reasonable choice if connectivity is of the highest priority.However, when n = 100, by employing the CCTC scheme,one would achieve 0.13% increase in connectivity, but threetimes more power would still be required. Therefore, somesystem designers may prefer the NCTC scheme to the CCTCscheme, for instance, where power efficiency is of the highestpriority or connectivity increase is not an issue. In contrast,ñCCCN, plotted by dotted by the green line, is about 1.2 in allcases. This means that only 20% more power is required to adda new cooperative edge using the CCTC scheme that cannotbe established in the CNTC scheme. Consequently, one canreplace the CNTC scheme with the CCTC scheme without aserious power consumption burden, regardless of node density.VI. CONCLUSIONIn this paper, we proposed to employ receiver cooperationin topology control to improve energy efficiency as well asnetwork connectivity. In particular, we proposed two centralizedtopology control schemes, one based solely on receivercooperation, and the other based both on transmitter and receivercooperations. For comparison, we also considered atopology control scheme that is based solely on transmittercooperation. By extensive simulation, we showed that we canimprove both connectivity and energy efficiency if we employreceiver cooperation in addition to transmitter cooperation.Consequently, it is generally more desirable to employ bothreceiver and transmitter cooperation than to employ transmittercooperation only. We also showed that the increase in networkconnectivity by employing transmitter cooperation in additionto receiver cooperation is at the expense of significantly increasedenergy consumption. For this reason, we conclude thatthe system based only on receiver cooperation could prove to bea good alternative to one based both on receiver and transmittercooperation, if energy efficiency is of the highest priority or theincrease in connectivity is no longer of serious concern.

Real-Time Path Planning Based on Hybrid-VANET-Enhanced Transportation System

05/08/201902/07/2019 by admin

Abstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.

Real-Time Detection of Traffic From Twitter Stream Analysis

05/08/201902/07/2019 by admin

Real-Time Detection of Traffic FromTwitter Stream AnalysisAbstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj ≤ Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj ≤ Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjp∈ RS is the pth relevant stem and Pj , with Pj ≤ Ljand Pj ≤ F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf ∈ SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) − H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 41◦ 53’ 35” and 12◦ 28’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn − 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and R− denote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.

Public Integrity Auditing for Shared Dynamic Cloud Data with Group User Revocation

05/08/201902/07/2019 by admin

REAL-TIME BIG DATA ANALYTICAL ARCHITECTURE FOR REMOTE

SENSING APPLICATION

ABSTRACT:

In today’s era, there is a great deal added to real-time remote sensing Big Data than it seems at first, and extracting the useful information in an efficient manner leads a system toward a major computational challenges, such as to analyze, aggregate, and store, where data are remotely collected. Keeping in view the above mentioned factors, there is a need for designing a system architecture that welcomes both realtime, as well as offline data processing. In this paper, we propose real-time Big Data analytical architecture for remote sensing satellite application.

The proposed architecture comprises three main units:

1) Remote sensing Big Data acquisition unit (RSDU);

2) Data processing unit (DPU); and

3) Data analysis decision unit (DADU).

First, RSDU acquires data from the satellite and sends this data to the Base Station, where initial processing takes place. Second, DPU plays a vital role in architecture for efficient processing of real-time Big Data by providing filtration, load balancing, and parallel processing. Third, DADU is the upper layer unit of the proposed architecture, which is responsible for compilation, storage of the results, and generation of decision based on the results received from DPU.

INTRODUCTION:

EXISTING SYSTEM:

Existing methods inapplicable on standard computers it is not desirable or possible to load the entire image into memory before doing any processing. In this situation, it is necessary to load only part of the image and process it before saving the result to the disk and proceeding to the next part. This corresponds to the concept of on-the-flow processing. Remote sensing processing can be seen as a chain of events or steps is generally independent from the following ones and generally focuses on a particular domain. For example, the image can be radio metrically corrected to compensate for the atmospheric effects, indices computed, before an object extraction based on these indexes takes place.

The typical processing chain will process the whole image for each step, returning the final result after everything is done. For some processing chains, iterations between the different steps are required to find the correct set of parameters. Due to the variability of satellite images and the variety of the tasks that need to be performed, fully automated tasks are rare. Humans are still an important part of the loop. These concepts are linked in the sense that both rely on the ability to process only one part of the data.

In the case of simple algorithms, this is quite easy: the input is just split into different non-overlapping pieces that are processed one by one. But most algorithms do consider the neighborhood of each pixel. As a consequence, in most cases, the data will have to be split into partially overlapping pieces. The objective is to obtain the same result as the original algorithm as if the processing was done in one go. Depending on the algorithm, this is unfortunately not always possible.

DISADVANTAGES:

A reader that loads the image, or part of the image in memory from the file on disk;

A filter which carries out a local processing that does not require access to neighboring pixels (a simple threshold for example), the processing can happen on CPU or GPU;

A filter that requires the value of neighboring pixels to compute the value of a given pixel (a convolution filter is a typical example), the processing can happen on CPU or GPU;

A writer to output the resulting image in memory into a file on disk, note that the file could be written in several steps. We will illustrate in this example how it is possible to compute part of the image in the whole pipeline, incurring only minimal computation overhead.

PROPOSED SYSTEM:

We present a remote sensing Big Data analytical architecture, which is used to analyze real time, as well as offline data. At first, the data are remotely preprocessed, which is then readable by the machines. Afterward, this useful information is transmitted to the Earth Base Station for further data processing. Earth Base Station performs two types of processing, such as processing of real-time and offline data. In case of the offline data, the data are transmitted to offline data-storage device. The incorporation of offline data-storage device helps in later usage of the data, whereas the real-time data is directly transmitted to the filtration and load balancer server, where filtration algorithm is employed, which extracts the useful information from the Big Data.

On the other hand, the load balancer balances the processing power by equal distribution of the real-time data to the servers. The filtration and load-balancing server not only filters and balances the load, but it is also used to enhance the system efficiency. Furthermore, the filtered data are then processed by the parallel servers and are sent to data aggregation unit (if required, they can store the processed data in the result storage device) for comparison purposes by the decision and analyzing server. The proposed architecture welcomes remote access sensory data as well as direct access network data (e.g., GPRS, 3G, xDSL, or WAN). The proposed architecture and the algorithms are implemented in applying remote sensing earth observatory data.

We proposed architecture has the capability of dividing, load balancing, and parallel processing of only useful data. Thus, it results in efficiently analyzing real-time remote sensing Big Data using earth observatory system. Furthermore, the proposed architecture has the capability of storing incoming raw data to perform offline analysis on largely stored dumps, when required. Finally, a detailed analysis of remotely sensed earth observatory Big Data for land and sea area are provided using .NET. In addition, various algorithms are proposed for each level of RSDU, DPU, and DADU to detect land as well as sea area to elaborate the working of architecture.

ADVANTAGES:

Big Data process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from medical application.

Our architecture for offline as well online traffic, we perform a simple analysis on remote sensing earth observatory data. We assume that the data are big in nature and difficult to handle for a single server.

The data are continuously coming from a satellite with high speed. Hence, special algorithms are needed to process, analyze, and make a decision from that Big Data. Here, in this section, we analyze remote sensing data for finding land, sea, or ice area.

We have used the proposed architecture to perform analysis and proposed an algorithm for handling, processing, analyzing, and decision-making for remote sensing Big Data images using our proposed architecture.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

ARCHITECTURE DIAGRAM

MODULES:

DATA ANALYSIS DECISION UNIT (DADU):

DATA PROCESSING UNIT (DPU):

REMOTE SENSING APPLICATION RSDU:

FINDINGS AND DISCUSSION:

ALGORITHM DESIGN AND TESTING:

MODULES DESCRIPTION:

DATA PROCESSING UNIT (DPU):

In data processing unit (DPU), the filtration and load balancer server have two basic responsibilities, such as filtration of data and load balancing of processing power. Filtration identifies the useful data for analysis since it only allows useful information, whereas the rest of the data are blocked and are discarded. Hence, it results in enhancing the performance of the whole proposed system. Apparently, the load-balancing part of the server provides the facility of dividing the whole filtered data into parts and assign them to various processing servers. The filtration and load-balancing algorithm varies from analysis to analysis; e.g., if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out, and is segmented into parts.

Each processing server has its algorithm implementation for processing incoming segment of data from FLBS. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results generated by each server are then sent to the aggregation server for compilation, organization, and storing for further processing.

DATA ANALYSIS DECISION UNIT (DADU):

DADU contains three major portions, such as aggregation and compilation server, results storage server(s), and decision making server. When the results are ready for compilation, the processing servers in DPU send the partial results to the aggregation and compilation server, since the aggregated results are not in organized and compiled form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation and compilation server is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into the result’s storage with the intention that any server can use it as it can process at any time.

The aggregation server also sends the same copy of that result to the decision-making server to process that result for making decision. The decision-making server is supported by the decision algorithms, which inquire different things from the result, and then make various decisions (e.g., in our analysis, we analyze land, sea, and ice, whereas other finding such as fire, storms, Tsunami, earthquake can also be found). The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant since any small error in decision-making can degrade the efficiency of the whole analysis. DADU finally displays or broadcasts the decisions, so that any application can utilize those decisions at real time to make their development. The applications can be any business software, general purpose community software, or other social networks that need those findings (i.e., decision-making).

REMOTE SENSING APPLICATION RSDU:

Remote sensing promotes the expansion of earth observatory system as cost-effective parallel data acquisition system to satisfy specific computational requirements. The Earth and Space Science Society originally approved this solution as the standard for parallel processing in this particular qualifications for improved Big Data acquisition, soon it was recognized that traditional data processing technologies could not provide sufficient power for processing such kind of data. Therefore, the need for parallel processing of the massive volume of data was required, which could efficiently analyze the Big Data. For that reason, the proposed RSDU is introduced in the remote sensing Big Data architecture that gathers the data from various satellites around the globe as possible that the received raw data are distorted by scattering and absorption by various atmospheric gasses and dust particles. We assume that the satellite can correct the erroneous data.

However, to make the raw data into image format, the remote sensing satellite uses effective data analysis, remote sensing satellite preprocesses data under many situations to integrate the data from different sources, which not only decreases storage cost, but also improves analysis accuracy. The data must be corrected in different methods to remove distortions caused due to the motion of the platform relative to the earth, platform attitude, earth curvature, nonuniformity of illumination, variations in sensor characteristics, etc. The data is then transmitted to Earth Base Station for further processing using direct communication link. We divided the data processing procedure into two steps, such as real-time Big Data processing and offline Big Data processing. In the case of offline data processing, the Earth Base Station transmits the data to the data center for storage. This data is then used for future analyses. However, in real-time data processing, the data are directly transmitted to the filtration and load balancer server (FLBS), since storing of incoming real-time data degrades the performance of real-time processing.

FINDINGS AND DISCUSSION:

Preprocessed and formatted data from satellite contains all or some of the following parts depending on the product.

1) Main product header (MPH): It includes the products basis information, i.e., id, measurement and sensing time, orbit, information, etc.

2) Special products head (SPH): It contains information specific to each product or product group, i.e., number of data sets descriptors (DSD), directory of remaining data sets in the file, etc.

3) Annotation data sets (ADS): It contains information of quality, time tagged processing parameters, geo location tie points, solar, angles, etc.

4) Global annotation data sets (GADs): It contains calling factors, offsets, calibration information, etc.

5) Measurement data set (MDS): It contains measurements or graphical parameters calculated from the measurement including quality flag and the time tag measurement as well. The image data are also stored in this part and are the main element of our analysis.

The MPH and SPH data are in ASCII format, whereas all the other data sets are in binary format. MDS, ADS, and GADs consist of the sequence of records and one or more fields of the data for each record. In our case, the MDS contains number of records, and each record contains a number of fields. Each record of the MDS corresponds to one row of the satellite image, which is our main focus during analysis.

ALGORITHM DESIGN AND TESTING:

Our algorithms are proposed to process high-speed, large amount of real-time remote sensory image data using our proposed architecture. It works on both DPU and DADU by taking data from satellite as input to identify land and sea area from the data set. The set of algorithms contains four simple algorithms, i.e., algorithm I, algorithm II, algorithm III, and algorithm IV that work on filtrations and load balancer, processing servers, aggregation server, and on decision-making server, respectively. Algorithm I, i.e., filtration and load balancer algorithm (FLBA) works on filtration and load balancer to filter only the require data by discarding all other information. It also provides load balancing by dividing the data into fixed size blocks and sending them to the processing server, i.e., one or more distinct blocks to each server. This filtration, dividing, and load-balancing task speeds up our performance by neglecting unnecessary data and by providing parallel processing. Algorithm II, i.e., processing and calculation algorithm (PCA) processes filtered data and is implemented on each processing server. It provides various parameter calculations that are used in the decision-making process. The parameters calculations results are then sent to aggregation server for further processing. Algorithm III, i.e., aggregation and compilations algorithm (ACA) stores, compiles, and organizes the results, which can be used by decision-making server for land and sea area detection. Algorithm IV, i.e., decision-making algorithm (DMA) identifies land area and sea area by comparing the parameters results, i.e., from aggregation servers, with threshold values.

IMPLEMENTATION:

Big Data covers diverse technologies same as cloud computing. The input of Big Data comes from social networks (Facebook, Twitter, LinkedIn, etc.), Web servers, satellite imagery, sensory data, banking transactions, etc. Regardless of very recent emergence of Big Data architecture in scientific applications, numerous efforts toward Big Data analytics architecture can already be found in the literature. Among numerous others, we propose remote sensing Big Data architecture to analyze the Big Data in an efficient manner as shown in Fig. 1. Fig. 1 delineates n number of satellites that obtain the earth observatory Big Data images with sensors or conventional cameras through which sceneries are recorded using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, etc. We have divided remote sensing Big Data architecture.

Healthcare scenarios, medical practitioners gather massive volume of data about patients, medical history, medications, and other details. The above-mentioned data are accumulated in drug-manufacturing companies. The nature of these data is very complex, and sometimes the practitioners are unable to show a relationship with other information, which results in missing of important information. With a view in employing advance analytic techniques for organizing and extracting useful information from Big Data results in personalized medication, the advance Big Data analytic techniques give insight into hereditarily causes of the disease.

ALGORITHMS:

This algorithm takes satellite data or product and then filters and divides them into segments and performs load-balancing algorithm.

The processing algorithm calculates results for different parameters against each incoming block and sends them to the next level. In step 1, the calculation of mean, SD, absolute difference, and the number of values, which are greater than the maximum threshold, are performed. Furthermore, in the next step, the results are transmitted to the aggregation server.

ACA collects the results from each processing servers against each Bi and then combines, organizes, and stores these results in RDBMS database.

CONCLUSION AND FUTURE:

In this paper, we proposed architecture for real-time Big Data analysis for remote sensing applications in the architecture efficiently processed and analyzed real-time and offline remote sensing Big Data for decision-making. The proposed architecture is composed of three major units, such as 1) RSDU; 2) DPU; and 3) DADU. These units implement algorithms for each level of the architecture depending on the required analysis. The architecture of real-time Big is generic (application independent) that is used for any type of remote sensing Big Data analysis. Furthermore, the capabilities of filtering, dividing, and parallel processing of only useful information are performed by discarding all other extra data. These processes make a better choice for real-time remote sensing Big Data analysis.

The algorithms proposed in this paper for each unit and subunits are used to analyze remote sensing data sets, which helps in better understanding of land and sea area. The proposed architecture welcomes researchers and organizations for any type of remote sensory Big Data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible for Big Data analysis for all applications, e.g., sensors and social networking. We are also planning to use the proposed architecture to perform complex analysis on earth observatory data for decision making at realtime, such as earthquake prediction, Tsunami prediction, fire detection, etc.

REFERENCES:

[1] D. Agrawal, S. Das, and A. E. Abbadi, “Big Data and cloud computing: Current state and future opportunities,” in Proc. Int. Conf. Extending Database Technol. (EDBT), 2011, pp. 530–533.

[2] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for Big Data,” PVLDB, vol. 2, no. 2, pp. 1481–1492, 2009.

[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[4] H. Herodotou et al., “Starfish: A self-tuning system for Big Data analytics,” in Proc. 5th Int. Conf. Innovative Data Syst. Res. (CIDR), 2011, pp. 261–272.

[5] K. Michael and K. W. Miller, “Big Data: New opportunities and new challenges [guest editors’ introduction],” IEEE Comput., vol. 46, no. 6, pp. 22–24, Jun. 2013.

[6] C. Eaton, D. Deroos, T. Deutsch, G. Lapis, and P. C. Zikopoulos, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. New York, NY, USA: Mc Graw-Hill, 2012.

[7] R. D. Schneider, Hadoop for Dummies Special Edition. Hoboken, NJ, USA: Wiley, 2012.

Proof of Ownership In Deduplicated Storage With Mobile Device Efficiency

05/08/201902/07/2019 by admin

Cloud storage such as Dropbox and Bitcasa is one of the most popular cloud services. Currently, with the prevalence of mobile cloud computing, users can even collaboratively edit the newest version of documents and synchronize the newest files on their smart mobile devices. A remarkable feature of current cloud storage is its virtually infinite storage. To support unlimited storage, the cloud storage provider uses data deduplication techniques to reduce the data to be stored and therefore reduce the storage expense. Moreover, the use of data deduplication also helps significantly reduce the need for bandwidth and therefore improve the user experience. Nevertheless, in spite of the above benefits, data deduplication has its inherent security weaknesses. Among them, the most severe is that the adversary may have an unauthorized file downloading via the file hash only. In this article we first review the previous solutions and identify their performance weaknesses. Then we propose an alternative design that achieves cloud server efficiency and especially mobile device efficiency.

1.2 INTRODUCTION

Mobile devices have become prevalent in recent years, and mobile computing has been a growing trend. Meanwhile, cloud computing is definitely the biggest revolution in recent decades. Many tasks, such as document editing and file backup, have been shifted from end devices to the cloud. Therefore, with the convergence of mobile computing and cloud computing, along with the recent development of the 5G communication standard that establishes more reliable and faster communication channels, mobile cloud computing (MCC) could be a rapidly growing field that deserves to be investigated and explored.

Deduplicated Storage in Mobile Cloud Computing Among cloud services, cloud storage with the capability of file backup and synchronization could be the most popular service that enables mobile users to access their files everywhere. Dropbox (https://www.dropbox.com/) and Bitcasa (https://www.bitcasa.com/) are two examples that offer easy-to-use file backup and synchronization services. Several remarkable features of such cloud storage can be identified. It has high availability, which means that the user’s data will be replicated over cloud servers worldwide and is guaranteed to be accessible whenever the user has the need. It has the flexibility in a pay-as-you-go model, which means that the user can gain additional storage immediately whenever the user is willing to make an extra payment. The most important feature is that it has virtually infinite storage space, which means that the user can backup whatever he/she wants to be uploaded to the cloud. A renowned example is Bitcasa, which offers “unlimited storage” that enables the user to upload virtually everything. Offering infinite storage space might cause a severe economic burden on the cloud storage provider.

However, a technique called data deduplication helps significantly reduce the cost of storage. Data deduplication has been widely implemented by cloud storage providers including Dropbox and Bitcasa. According to the report in [8] (http://www.snia.org), the use of data deduplication in business applications may reduce the data to be stored and thus achieve disk and bandwidth savings of more than 90 percent. The power of data deduplication is achieved by avoiding storing the same file multiple times. The storage saving is more obvious especially when the popular multimedia contents such as music and movies are considered. The replicated contents create an additional storage need the first time they are uploaded, but create no extra storage need for subsequent uploads. In addition to storage saving, if the data content has been in the storage, then the replicated content has no need to be transmitted, achieving bandwidth saving. Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not.

We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a. Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

1.3 LITRATURE SURVEY

DUPLESS: SERVERAIDED ENCRYPTION FOR DEDUPLICATED STORAGE

AUTHOR: M. Bellare, S. Keelveedhi, and T. Ristenpart

PUBLISH: Proc. 22nd USENIX Conf. Sec. Symp., 2013, pp. 179–194.

EXPLANATION:

Cloud storage service providers such as Dropbox, Mozy, and others perform deduplication to save space by only storing one copy of each file uploaded. Should clients conventionally encrypt their files, however, savings are lost. Message-locked encryption (the most prominent manifestation of which is convergent encryption) resolves this tension. However it is inherently subject to brute-force attacks that can recover files falling into a known set. We propose an architecture that provides secure deduplicated storage resisting brute-force attacks, and realize it in a system called DupLESS. In DupLESS, clients encrypt under message-based keys obtained from a key-server via an oblivious PRF protocol. It enables clients to store encrypted data with an existing service, have the service perform deduplication on their behalf, and yet achieves strong confidentiality guarantees. We show that encryption for deduplicated storage can achieve performance and space savings close to that of using the storage service with plaintext data.

FAST AND SECURE LAPTOP BACKUPS WITH ENCRYPTED DE-DUPLICATION

AUTHOR: P. Anderson and L. Zhang

PUBLISH: Proc. 24th Int. Conf. Large Installation Syst. Admin., 2010, pp. 29–40.

EXPLANATION:

Many people now store large quantities of personal and corporate data on laptops or home computers. These often have poor or intermittent connectivity, and are vulnerable to theft or hardware failure. Conventional backup solutions are not well suited to this environment, and backup regimes are frequently inadequate. This paper describes an algorithm which takes advantage of the data which is common between users to increase the speed of backups, and reduce the storage requirements. This algorithm supports client-end per-user encryption which is necessary for confidential personal data. It also supports a unique feature which allows immediate detection of common subtrees, avoiding the need to query the backup system for every file. We describe a prototype implementation of this algorithm for Apple OS X, and present an analysis of the potential effectiveness, using real data obtained from a set of typical users. Finally, we discuss the use of this prototype in conjunction with remote cloud storage, and present an analysis of the typical cost savings.

SECURE DEDUPLICATION WITH EFFICIENT AND RELIABLE CONVERGENT KEY MANAGEMENT

AUTHOR: J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou

PUBLISH: IEEE Trans. Parallel Distrib. Syst., http://oi.ieeecomputersociety.org/10.1109/TPDS.2013.284, 2013

EXPLANATION:

Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Data de duplication is one of important data compression techniques for eliminating duplicate copies of repeating data, and has been widely used in cloud storage to reduce the amount of storage space and save bandwidth. Previous de duplication systems cannot support differential authorization duplicate check, which is important in many applications. In such an authorized de duplication system, each user is issued a set of privileges during system initialization Each file uploaded to the cloud is also bounded by a set of privileges to specify which kind of users is allowed to perform the duplicate check and access the files.

Before submitting his duplicate check request for a file, the user needs to take this file and his own privileges as inputs. The user is able to find a duplicate f or this file if and only if there is a copy of this file and a matched privilege stored in cloud. Traditional de duplication systems based on convergent encryption, although providing confidentiality to some extent; do not support the duplicate check with differential privileges. In other words, no differential privileges have been considered in the de duplication based on convergent encryption technique. It seems to be contradicted if we want to realize both de duplication and differential authorization duplicate check at the same time.

2.1.1 DISADVANTAGES:

De duplication systems cannot support differential authorization duplicate check.
One critical challenge of cloud storage services is the management of the ever increasing volume of data.
Users’ sensitive data are susceptible to both insider and outsider attacks.
Sometimes de duplication impossible.

2.2 PROPOSED SYSTEM:

We propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

2.2.1 ADVANTAGES:

POW scheme such as the bandwidth requirement, I/O overhead at both user and server sides, and the computation overhead at both sides concern the performance, the second is less known in the POW design. More specifically, cloud storage usually has a storage hierarchy: the memory (primary storage) and disk (secondary storage).

The execution of a POW scheme might require the user and cloud to access the file stored in the disk multiple times. The server might also need to keep the verification object in either the memory or the disk to verify the user’s claim.

The above all might result in a huge amount of I/O delay because of the access time gap between the memory and disk. In this article we focus only on the abuse of a file hash to gain the ownership of the file and aim to design a POW scheme with minimum performance overhead.

To prevent unauthorized access, a secure proof of ownership (POW) protocol is also needed to provide the proof that the user indeed owns the same file when a duplicate is found.
It makes overhead to minimal compared to the normal convergent encryption and file upload operations.
Data confidentiality is maintained.
Secure compared to proposed techniques

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

USER:

ADMIN:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

SENDER USER:

RECEIVER USER:

3.5 ACTIVITY DIAGRAM:

SENDER LOGIN:

RECEIVER LOGIN:

CHAPTER 4

4.0 IMPLEMENTATION:

MOBILE CLOUD COMPUTING:

Mobile Cloud Computing (MCC) is the combination of cloud computing, mobile computing and wireless networks to bring rich computational resources to mobile users, network operators, as well as cloud computing providers. The ultimate goal of MCC is to enable execution of rich mobile applications on a plethora of mobile devices, with a rich user experience. MCC provides business opportunities for mobile network operators as well as cloud providers. “A rich mobile computing technology that leverages uniﬁed elastic resources of varied clouds and network technologies toward unrestricted functionality, storage, and mobility to serve a multitude of mobile devices anywhere, anytime through the channel of Ethernet or Internet regardless of heterogeneous environments and platforms based on the pay-as-you-use principle.

ARCHITECTURE:

MCC uses computational augmentation approachesby which resource-constraint mobile devices can utilize computational resources of varied cloud-based resources. In MCC, there are four types of cloud-based resources, namely distant immobile clouds, proximate immobile computing entities, proximate mobile computing entities, and hybrid (combination of the other three models). Giant clouds such as Amazon EC2 are in the distant immobile groups whereas cloudlet or surrogates are member of proximate immobile computing entities. Smartphones, tablets, handheld devices, and wearable computing devices are part of the third group of cloud-based resources which is proximate mobile computing entities.

DIAGRAM:

In the MCC landscape, an amalgam of mobile computing, cloud computing, and communication networks (to augment smartphones) creates several complex challenges such as Mobile Computation Offloading, Seamless Connectivity, Long WAN Latency, Mobility Management, Context-Processing, Energy Constraint, Vendor/data Lock-in, Security and Privacy, Elasticity that hinder MCC success and adoption.

Although significant research and development in MCC is available in the literature, efforts in the following domains are still lacking:

Architectural issues: Reference architecture for heterogeneous MCC environment is a crucial requirement for unleashing the power of mobile computing towards unrestricted ubiquitous computing.
Energy-efficient transmission: MCC requires frequent transmissions between cloud platform and mobile devices, due to the stochastic nature of wireless networks, the transmission protocol should be carefully designed.
Context-awareness issues: Context-aware and socially-aware computing are inseparable traits of contemporary handheld computers. To achieve the vision of mobile computing among heterogeneous converged networks and computing devices, designing resource-efﬁcient environment-aware applications is an essential need.
Live VM migration issues: Executing resource-intensive mobile application via Virtual Machine (VM) migration-based application ofﬂoading involves encapsulation of application in VM instance and migrating it to the cloud, which is a challenging task due to additional overhead of deploying and managing VM on mobile devices.
Mobile communication congestion issues: Mobile data trafﬁc is tremendously hiking by ever increasing mobile user demands for exploiting cloud resources which impact on mobile network operators and demand future efforts to enable smooth communication between mobile and cloud endpoints.
Trust, security, and privacy issues: Trust is an essential factor for the success of the burgeoning MCC paradigm.

PROOF OF OWNERSHIP:

An even more severe and direct security threat from using deduplicated cloud storage is that the adversary may gain the ownership of files by only eavesdropping on file hashes. A closer look at client side deduplication can find that anyone in possession of the file hash can gain ownership of the file by uploading the file hash. More specifically, the cloud considers receiving a store request for a file already in the storage, avoids the redundant file transmission, and then adds the user as an additional owner of the file. An illustrative example is shown in Fig. 3d. Such a situation is apparently undesirable because in theory the adversary cannot infer the file content via the hash.

However, in this case, once the adversary knows the hash, it is able to download the entire file content. On the other hand, in practice, the user considers the hash unharmful and in some cases publishes the hashes as timestamps. However, the publicly available hashes can be abused to gain the file. This security weakness comes from using the static and short piece of information (hash) as a way of claiming file ownership. Motivated by this observation, Halevi et al. [10] introduce the notion of proof of ownership (POW). A POW scheme is jointly executed by the cloud and user such that the user can prove to the cloud that it is indeed in possession of the file.

4.1 ALGORITHM:

PUBLIC KEY INFRASTRUCTURE (PKI) AND PRIVATE KEY GENERATOR (PKG):

In cryptography, the ElGamal encryption system is an asymmetric key encryption algorithm for public-key cryptography which is based on the Diffie–Hellman key exchange. It was described by Taher Elgamal in 1985. ElGamal encryption is used in the free GNU Privacy Guard software, recent versions of PGP, and other cryptosystems. The DSA (Digital Signature Algorithm) is a variant of the ElGamal signature scheme, which should not be confused with ElGamal encryption. The security of the ElGamal scheme depends on the properties of the underlying group as well as any padding scheme used on the messages.

If the computational Diffie–Hellman assumption (CDH) holds in the underlying cyclic group , then the encryption function is one-way. If the decisional Diffie–Hellman assumption (DDH) holds in , then ElGamal achieves semantic security. Semantic security is not implied by the computational Diffie–Hellman assumption alone. See decisional Diffie–Hellman assumption for a discussion of groups where the assumption is believed to hold.

To achieve chosen-ciphertext security, the scheme must be further modified, or an appropriate padding scheme must be used. Depending on the modification, the DDH assumption may or may not be necessary.

Other schemes related to ElGamal which achieve security against chosen ciphertext attacks have also been proposed. The Cramer–Shoup cryptosystem is secure under chosen ciphertext attack assuming DDH holds for. Its proof does not use the random oracle model. Another proposed scheme is DHAES whose proof requires an assumption that is weaker than the DDH assumption.

4.2 MODULES:

SECURE USER MODULES:

DEDUPLICATED STORAGE:

CHECK DEDUPLICATES:

APPLY POW SCHEME:

SECURE SEND KEY:

4.3 MODULE DESCRIPTION:

SECURE USER MODULES:

In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first.

Registration
File View
Encryption
Download
Upload Files
Encrypt and save to cloud

DEDUPLICATED STORAGE:

Client side deduplication incurs its own security weaknesses. First, the privacy of the file existence in the cloud may be compromised because the adversary may try to upload the candidate files to see whether the deduplication takes place. If the deduplication takes place, this will be an indica tor of the file’s existence. Otherwise, the adversary may infer the file’s nonexistence. The situation becomes even worse when we consider the low-entropy files because the adversary may exhaustively create different files and upload the hashes to check the file’s existence. For example, a curious colleague may query his/her manager’s salary by uploading different salary sheets because the sheets are of a similar form, restricting the number of file contents to be tested.

CHECK DEDUPLICATES:

Data deduplication can be categorized as two types depending on where the deduplication take places: server (cloud) side deduplication and client (user) side deduplication. Server side deduplication is simple: the server, after receiving the file, checks whether it already has a copy in storage. The server discards the received file if it does, or creates a new file in the storage if it does not. We can see that server side deduplication does not produce bandwidth saving because the server performs the deduplication after the file has been received. On the other hand, client side deduplication adopts a more aggressive method: the user calculates and sends the hash of the file before uploading the file. Upon receiving the hash, the server checks whether the hash is already in storage. The user is asked to send nothing and the server associates the user with the existing file if so. The user is asked to upload the file otherwise. An illustrative example is shown in Fig. 2, where user 1 first uploads files F1 and F2 in Fig. 2a.

Then the cloud knows from the hashes h(F1) and h(F2) sent by user 2 that there has been a copy of F1 in storage and sends a positive Acknowledgment and negative Acknowledgment to user 2. User 2, according to Acknowledgments, sends only F3, saving the transmission of F1. Public cloud storage services (e.g. Dropbox and Bitcasa) are more likely to adopt client side deduplication because of its storage and bandwidth savings. In particular, in addition to the reduced storage requirement, the client side deduplication can also reduce the need for file transmission, allowing the reduction of waiting time for users and energy consumption for the server. We particularly mention that even with the increased bandwidth of the coming 5G communication standard, the data rate of wireless links is still not compatible to that of wired links. Thus, if we consider the mobile devices accessing cloud storage services, client side deduplication becomes an inevitable technique for MCC applications.

APPLY POW SCHEME:

The POW schemes in performance very well on the server side since only a small size index (tree root) needs to be stored in the main memory. However, the proof of ownership is achieved by the user sending an authentication path of size O(log |f|) to the cloud, resulting in more communication overhead and computation load on the cloud. The I/O overhead of the user side is also increased, compared to the POW schemes in, because the user needs to retrieve the entire file. On the other extreme, although the s-POW schemes in have great computation and I/O efficiency in the user side, its I/O burden on the cloud is significantly increased since the cloud is required to retrieve random bits from the secondary storage.

In this article we propose an alternative design that strikes a balance between server side efficiency and user side efficiency. Before introducing the scheme’s details, we present two observations. First, the POW schemes in are I/O efficient at the server side because the Merkle tree root can be thought of as a compact summary of the file. Therefore, there is no need for the cloud to access the disk to retrieve the file. Second, the user side is computationally efficient in three s-POW schemes because the user is simply required only to answer a few bits of the file. With the above two observations, our design strategy is to have a probabilistic data structure for the compact summary of the file, in contrast to the deterministic data structure, Merkle hash tree, in the POW schemes. The query challenge is also modified as random blocks, in contrast to the random bits in s-POW schemes. An overview of the proposed POW scheme goes as follows.

SECURE SEND KEY:

Once the key request was received, the sender can send the key or he can decline it. With this key and request id which was generated at the time of sending key request the receiver can decrypt the message.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

5.1.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

5.1.3 SOCIAL FEASIBILITY:

The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

5.2 SYSTEM TESTING:

This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

A program represents the logical elements of a system. For a program to run satisfactorily, it must compile and test data correctly and tie in properly with other programs. Achieving an error free program is the responsibility of the programmer. Program testing checks for two types of errors: syntax and logical. Syntax error is a program statement that violates one or more rules of the language in which it is written. An improperly defined field dimension or omitted keywords are common syntax errors. These errors are shown through error message generated by the computer. For Logic errors the programmer must examine the output carefully.

5.1.2 FUNCTIONAL TESTING:

Functional testing of an application is used to prove the application delivers correct results, using enough inputs to give an adequate level of confidence that will work correctly for all sets of inputs. The functional testing will need to prove that the application works for each client type and that personalization function work correctly.When a program is tested, the actual output is compared with the expected output. When there is a discrepancy the sequence of instructions must be traced to determine the problem. The process is facilitated by breaking the program into self-contained portions, each of which can be checked at certain key points. The idea is to compare program values against desk-calculated values to isolate the problems.

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

The Non Functional software testing encompasses a rich spectrum of testing strategies, describing the expected results for every test case. It uses symbolic analysis techniques. This testing used to check that an application will work in the operational environment. Non-functional testing includes:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

An important tool for implementing system tests is a Load generator. A Load generator is essential for testing quality requirements such as performance and stress. A load can be a real load, that is, the system can be put under test to real usage by having actual telephone users connected to it. They will generate test input data for system test.

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Performance tests are utilized in order to determine the widely defined performance of the software system such as execution time associated with various parts of the code, response time and device utilization. The intent of this testing is to identify weak points of the software system and quantify its shortcomings.

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

The software reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time and it is being ensured in this testing. Reliability can be expressed as the ability of the software to reveal defects under testing conditions, according to the specified requirements. It the portability that a software system will operate without failure under given conditions for a given time interval and it focuses on the behavior of the software element. It forms a part of the software quality control team.

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Security testing evaluates system characteristics that relate to the availability, integrity and confidentiality of the system data and services. Users/Clients should be encouraged to make sure their security needs are very clearly known at requirements time, so that the security issues can be addressed by the designers and testers.

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

White box testing, sometimes called glass-box testing is a test case design method that uses the control structure of the procedural design to derive test cases. Using white box testing method, the software engineer can derive test cases. The White box testing focuses on the inner structure of the software structure to be tested.

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Black box testing, also called behavioral testing, focuses on the functional requirements of the software. That is, black testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not alternative to white box techniques. Rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors which focuses on inputs, outputs, and principle function of a software module. The starting point of the black box testing is either a specification or code. The contents of the box are hidden and the stimulated software should produce the desired results.

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes —the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make “write once, run anywhere” possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

6.2 THE JAVA PLATFORM:

A platform is the hardware or software environment in which a program runs. We’ve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that it’s a software-only platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The most common types of programs written in the Java programming language are applets and applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser.

However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet.

A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

We can’t promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a de facto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesn’t change whether it talks to Oracle or SQL Server. We only mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year.

6.6 JDBC:

In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book.

6.7 JDBC Goals:

Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end user.

SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle non-standard functionality in a manner that is suitable for its users.

JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa.

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API.

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer.

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end.

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are “well known”.

Sockets:

A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here “family” will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe – but the actual pipe does not yet exist.

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

Charts showing values that relate to geographical areas. Some examples include: (a) population density in each state of the United States, (b) income per capita for each country in Europe, (c) life expectancy in each country of the world. The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more.

6.8.2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts — to display a separate control that shows a small version of ALL the time series data, with a sliding “view” rectangle that allows you to select the subset of the time series data to display in the main chart.

6.8.3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet.

6.8.4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or reimplement) this mechanism to provide greater end-user control over the appearance of the charts.

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

We propose an alternative POW design on the problem of unauthorized file downloading in deduplicated cloud storage. In our design, the use of probabilistic data structure, the Bloom filter, primarily contributes to the overhead reduction. Since the Bloom filter has been used widely in various applications and is easy to be implemented, our proposed POW scheme is considered realistic and can be deployed in real-world cloud storage services. Despite the use of the Bloom filter in reducing the I/O needs, the size of the Bloom filter may grow with the number of files stored in the cloud. The Bloom filter may also be of a huge size so that it needs to be partitioned and part of it needs to be stored in the disk. Thus, one possible future research focus is to develop a more succinct data structure or devise a new index mechanism such that the index (the Bloom filter in this article) can be fit into the memory even in the case of a huge number of files in the cloud.

Privacy-Preserving Detection of Sensitive Data Exposure

05/08/201902/07/2019 by admin

An initiative data prefetching scheme on the storage servers in distributed file systems for cloud computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to forecast future block access operations for directing what data should be fetched on storage servers in advance.

Finally, the prefetched data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed file systems for cloud environments to achieve better I/O performance. In particular, configurationlimited client machines in the cloud are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.

1.2 INTRODUCTION

The assimilation of distributed computing for search engines, multimedia websites, and data-intensive applications has brought about the generation of data at unprecedented speed. For instance, the amount of data created, replicated, and consumed in United States may double every three years through the end of this decade, according to the general, the file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of dataintensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications.

However, because distributed file systems scale both numerically and geographically, the network delay is becoming the dominant factor in remote file system access [26], [34]. With regard to this issue, numerous data prefetching mechanisms have been proposed to hide the latency in distributed file systems caused by network communication and disk operations. In these conventional prefetching mechanisms, the client file system (which is a part of the file system and runs on theclient machine) is supposed to predict future access by analyzing the history of occurred I/O access without any application intervention. After that, the client file system may send relevant I/O requests to storage servers for reading the relevant data in. Consequently, the applications that have intensive read workloads can automatically yield not only better use of available bandwidth, but also less file operations via batched I/O requests through prefetching.

On the other hand, mobile devices generally have limited processing power, battery life and storage, but cloud computing offers an illusion of infinite computing resources. For combining the mobile devices and cloud computing to create a new infrastructure, the mobile cloud computing research field emerged [45]. Namely, mobile cloud computing provides mobile applications with data storage and processing services in clouds, obviating the requirement to equip a powerful hardware configuration, because all resource-intensive computing can be completed in the cloud. Thus, conventional prefetching schemes are not the best-suited optimization strategies for distributed file systems to boost I/O performance in mobile clouds, since these schemes require the client file systems running on client machines to proactively issue prefetching requests after analyzing the occurred access events recorded by them, which must place negative effects to the client nodes.

Furthermore, considering only disk I/O events can reveal the disk tracks that can offer critical information to perform I/O optimization tactics certain prefetching techniques have been proposed in succession to read the data on the disk in advance after analyzing disk I/O traces. But, this kind of prefetching only works for local file systems, and the prefetched data iscached on the local machine to fulfill the application’s I/O requests passively in brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers.

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

This paper presents PARTE, a prototype parallel file system with active/standby configured metadata servers (MDSs). PARTE replicates and distributes a part of files’ metadata to the corresponding metadata stripes on the storage servers (OSTs) with a per-file granularity, meanwhile the client file system (client) keeps certain sent metadata requests. If the active MDS has crashed for some reason, these client backup requests will be replayed by the standby MDS to restore the lost metadata. In case one or more backup requests are lost due to network problems or dead clients, the latest metadata saved in the associated metadata stripes will be used to construct consistent and up-to-date metadata on the standby MDS. Moreover, the clients and OSTs can work in both normal mode and recovery mode in the PARTE file system. This differs from conventional active/standby configured MDSs parallel file systems, which hang all I/O requests and metadata requests during restoration of the lost metadata. In the PARTE file system, previously connected clients can continue to perform I/O operations and relevant metadata operations, because OSTs work as temporary MDSs during that period by using the replicated metadata in the relevant metadata stripes. Through examination of experimental results, we show the feasibility of the main ideas presented in this paper for providing high availability metadata service with only a slight overhead effect on I/O performance. Furthermore, since previously connected clients are never hanged during metadata recovery, in contrast to conventional systems, a better overall I/O data throughput can be achieved with PARTE.

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

Recently, power has emerged as a critical factor in designing components of storage systems, especially for power-hungry data centers. While there is some research into power-aware storage stack components, there are no systematic studies evaluating each component’s impact separately. This paper evaluates the file system’s impact on energy consumption and performance. We studied several popular Linux file systems, with various mount and format options, using the FileBench workload generator to emulate four server workloads: Web, database, mail, and file server. In case of a server node consisting of a single disk, CPU power generally exceeds disk-power consumption. However, file system design, implementation, and available features have a signifi- cant effect on CPU/disk utilization, and hence on performance and power. We discovered that default file system options are often suboptimal, and even poor. We show that a careful matching of expected workloads to file system types and options can improve power-performance efficiency by a factor ranging from 1.05 to 9.4 times.

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

WheelFS is a wide-area distributed storage system intended to help multi-site applications share data and gain fault tolerance. WheelFS takes the form of a distributed file system with a familiar POSIX interface. Its design allows applications to adjust the tradeoff between prompt visibility of updates from other sites and the ability for sites to operate independently despite failures and long delays. WheelFS allows these adjustments via semantic cues, which provide application control over consistency, failure handling, and file and replica placement. WheelFS is implemented as a user-level file system and is deployed on PlanetLab and Emulab. Three applications (a distributed Web cache, an email service and large file distribution) demonstrate that WheelFS’s file system interface simplifies construction of distributed applications by allowing reuse of existing software. These applications would perform poorly with the strict semantics implied by a traditional file system interface, but by providing cues to WheelFS they are able to achieve good performance. Measurements show that applications built on WheelFS deliver comparable performance to services such as CoralCDN and BitTorrent that use specialized wide-area storage systems.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

The file system deployed in a distributed computing environment is called a distributed file system, which is always used to be a backend storage system to provide I/O services for various sorts of data intensive applications in cloud computing environments. In fact, the distributed file system employs multiple distributed I/O devices by striping file data across the I/O nodes, and uses high aggregate bandwidth to meet the growing I/O requirements of distributed and parallel scientific applications benchmark to create OLTP workloads, since it is able to create similar OLTP workloads that exist in real systems. All the configured client file systems executed the same script, and each of them run several threads that issue OLTP requests. Because Sysbench requires MySQL installed as a backend for OLTP workloads, we configured mysqld process to 16 cores of storage servers. As a consequence, it is possible to measure the response time to the client request while handling the generated workloads.

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

Proposed in succession to read the data on the disk in advance after analyzing disk I/O traces of prefetching only works for local file systems, and the prefetched data is cached on the local machine to fulfill the application’s I/O requests passively. In brief, although block access history reveals the behavior of disk tracks, there are no prefetching schemes on storage servers in a distributed file system for yielding better system performance. And the reason for this situation is because of the difficulties in modeling the block access history to generate block access patterns and deciding the destination client machine for driving the prefetched data from storage servers. To yield attractive I/O performance in the distributed file system deployed in a mobile cloud environment or a cloud environment that has many resource-limited client machines, this paper presents an initiative data prefetching mechanism. The proposed mechanism first analyzes disk I/O tracks to predict the future disk I/O access so that the storage servers can fetch data in advance, and then forward the prefetched data to relevant client file systems for future potential usages.

This paper makes the following two contributions:

1) Chaotic time series prediction and linear regression prediction to forecast disk I/O access. We have modeled the disk I/O access operations, and classified them into two kinds of access patterns, i.e. the random access pattern and the sequential access pattern. Therefore, in order to predict the future I/O access that belongs to the different access patterns as accurately as possible (note that the future I/O access indicates what data will be requested in the near future), two prediction algorithms including the chaotic time series prediction algorithm and the linear regression prediction algorithm have been proposed respectively. 2) Initiative data prefetching on storage servers. Without any intervention from client file systems except for piggybacking their information onto relevant I/O requests to the storage servers. The storage servers are supposed to log disk I/O access and classify access patterns after modeling disk I/O events. Next, by properly using two proposed prediction algorithms, the storage servers can predict the future disk I/O access to guide prefetching data. Finally, the storage servers proactively forward the prefetched data to the relevant client file systems for satisfying future application’s requests.

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

We have proposed, implemented and evaluated an initiative data prefetching approach on the storage servers for distributed file systems, which can be employed as a backend storage system in a cloud environment that may have certain resource-limited client machines. To be specific, the storage servers are capable of predicting future disk I/O access to guide fetching data in advance after analyzing the existing logs, and then they proactively push the prefetched data to relevant client file systems for satisfying future applications’ requests.

Purpose of effectively modeling disk I/O access patterns and accurately forwarding the prefetched data, the information about client file systems is piggybacked onto relevant I/O requests, and then transferred from client nodes to corresponding storage server nodes. Therefore, the client file systems running on the client nodes neither log I/O events nor conduct I/O access prediction; consequently, the thin client nodes can focus on performing necessary tasks with limited computing capacity and energy endurance.

Initiative prefetching scheme can be applied in the distributed file system for a mobile cloud computing environment, in which there are many tablet computers and smart terminals. The current implementation of our proposed initiative prefetching scheme can classify only two access patterns and support two corresponding prediction algorithms for predicting future disk I/O access. We are planning to work on classifying patterns for a wider range of application benchmarks in the future by utilizing the horizontal visibility graph technique applying network delay aware replica selection techniques for reducing network transfer time when prefetching data among several replicas is another task in our future work.

Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites

05/08/201902/07/2019 by admin

With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences.

We propose a two-level framework which according to the user’s available history on the site, determines the best available privacy policy for the user’s images being uploaded. Our solution relies on an image classification framework for image categories which may be associated with similar policies, and on a policy prediction algorithm to automatically generate a policy for each newly uploaded image, also according to users’ social features. Over time, the generated policies will follow the evolution of users’ privacy attitude. We provide the results of our extensive evaluation over 5,000 policies, which demonstrate the effectiveness of our system, with prediction accuracies over 90 percent.

1.2 INTRODUCTION

Images are now one of the key enablers of users’ connectivity. Sharing takes place both among previously established groups of known people or social circles (e. g., Google+, Flickr or Picasa), and also increasingly with people outside the users social circles, for purposes of social discovery-to help them identify new peers and learn about peers interests and social surroundings. However, semantically rich images may reveal contentsensitive information. Consider a photo of a students 2012 graduationceremony, for example.

It could be shared within a Google+ circle or Flickr group, but may unnecessarily expose the studentsBApos familymembers and other friends. Sharing images within online content sharing sites,therefore,may quickly leadto unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings. One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone. Therefore, many have acknowledged the need of policy recommendation systems which can assist users to easily and properly configure privacy settings. However, existing proposals for automating privacy settings appear to be inadequate to address the unique privacy needs of images due to the amount of information implicitly carried within images, and their relationship with the online environment wherein they are exposed.

1.3 LITRATURE SURVEY

TITLE NAME: SHEEPDOG: GROUP AND TAG RECOMMENDATION FOR FLICKR PHOTOS BY AUTOMATIC SEARCH-BASED LEARNING

AUTHOR: H.-M. Chen, M.-H. Chang, P.-C. Chang, M.-C. Tien, W. H. Hsu, and J.-L. Wu,

PUBLISH: Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 737–740.

EXPLANATION:

Online photo albums have been prevalent in recent years and have resulted in more and more applications developed to provide convenient functionalities for photo sharing. In this paper, we propose a system named SheepDog to automatically add photos into appropriate groups and recommend suitable tags for users on Flickr. We adopt concept detection to predict relevant concepts of a photo and probe into the issue about training data collection for concept classification. From the perspective of gathering training data by web searching, we introduce two mechanisms and investigate their performances of concept detection. Based on some existing information from Flickr, a ranking-based method is applied not only to obtain reliable training data, but also to provide reasonable group/tag recommendations for input photos. We evaluate this system with a rich set of photos and the results demonstrate the effectiveness of our work.

TITLE NAME: CONNECTING CONTENT TO COMMUNITY IN SOCIAL MEDIA VIA IMAGE CONTENT, USER TAGS AND USER COMMUNICATION

AUTHOR: M. D. Choudhury, H. Sundaram, Y.-R. Lin, A. John, and D. D. Seligmann

PUBLISH: Proc. IEEE Int. Conf. Multimedia Expo, 2009, pp.1238–1241.

EXPLANATION:

In this paper we develop a recommendation framework to connect image content with communities in online social media. The problem is important because users are looking for useful feedback on their uploaded content, but finding the right community for feedback is challenging for the end user. Social media are characterized by both content and community. Hence, in our approach, we characterize images through three types of features: visual features, user generated text tags, and social interaction (user communication history in the form of comments). A recommendation framework based on learning a latent space representation of the groups is developed to recommend the most likely groups for a given image. The model was tested on a large corpus of Flickr images comprising 15,689 images. Our method outperforms the baseline method, with a mean precision 0.62 and mean recall 0.69. Importantly, we show that fusing image content, text tags with social interaction features outperforms the case of only using image content or tags.

TITLE NAME: ANALYSING FACEBOOK FEATURES TO SUPPORT EVENT DETECTION FOR PHOTO-BASED FACEBOOK APPLICATIONS

AUTHOR: M. Rabbath, P. Sandhaus, and S. Boll,

PUBLISH: Proc. 2nd ACM Int. Conf. Multimedia Retrieval, 2012, pp. 11:1–11:8.

EXPLANATION:

Facebook witnesses an explosion of the number of shared photos: With 100 million photo uploads a day it creates as much as a whole Flickr each two months in terms of volume. Facebook has also one of the healthiest platforms to support third party applications, many of which deal with photos and related events. While it is essential for many Facebook applications, until now there is no easy way to detect and link photos that are related to the same events, which are usually distributed between friends and albums. In this work, we introduce an approach that exploits Facebook features to link photos related to the same event. In the current situation where the EXIF header of photos is missing in Facebook, we extract visual-based, tagged areas-based, friendship-based and structure-based features. We evaluate each of these features and use the results in our approach. We introduce and evaluate a semi-supervised probabilistic approach that takes into account the evaluation of these features. In this approach we create a lookup table of the initialization values of our model variables and make it available for other Facebook applications or researchers to use. The evaluation of our approach showed promising results and it outperformed the other the baseline method of using the unsupervised EM algorithm in estimating the parameters of a Gaussian mixture model. We also give two examples of the applicability of this approach to help Facebook applications in better serving the user.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Image content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive nature, disclosing many details of the users’ private sphere. In order to support users in making privacy decisions in the context of image sharing and to provide them with a better overview on privacy related visual content available on the Web techniques to automatically detect private images, and to enable privacy-oriented image search.

To this end, we learn privacy classifiers trained on a large set of manually assessed Flickr photos, combining textual metadata of images with a variety of visual features. We employ the resulting classification models for specifically searching for private photos, and for diversifying query results to provide users with a better coverage of private and public content. Most content sharing websites allow users to enter their privacy preferences. Unfortunately, recent studies have shown that users struggle to set up and maintain such privacy settings.

One of the main reasons provided is that given the amount of shared information this process can be tedious and error-prone of policy recommendation systems which can assist users too easily and properly configure privacy settings.

2.1.1 DISADVANTAGES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations.
Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content.
The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information.

2.2 PROPOSED SYSTEM:

We propose an Adaptive Privacy Policy Prediction (A3P) system which aims to provide users a hassle free privacy settings experience by automatically generating personalized policies. The A3P system handles user uploaded images, and factors in the following criteria that influence one’s privacy settings of images:

The impact of social environment and personal characteristics: Social context of users, such as their profile information and relationships with others may provide useful information regarding users’ privacy preferences. For example, users interested in photography may like to share their photos with other amateur photographers. Users who have several family members among their social contacts may share with them pictures related to family events. However, using common policies across all users or across users with similar traits may be too simplistic and not satisfy individual preferences.

Users may have drastically different opinions even on the same type of images. For example, a privacy adverse person may be willing to share all his personal images while a more conservative person may just want to share personal images with his family members. In light of these considerations, it is important to find the balancing point between the impact of social environment and users’ individual characteristics in order to predict the policies that match each individual’s needs.

The role of image’s content and metadata: In general, similar images often incur similar privacy preferences, especially when people appear in the images. For example, one may upload several photos of his kids and specify that only his family members are allowed to see these photos. He may upload some other photos of landscapes which he took as a hobby and for these photos, he may set privacy preference allowing anyone to view and comment the photos. Analyzing the visual content may not be sufficient to capture users’ privacy preferences. Tags and other metadata are indicative of the social context of the image, including where it was taken and why, and also provide a synthetic description of images, complementing the information obtained from visual content analysis.

2.2.1 ADVANTAGES:

The A3P-core focuses on analyzing each individual user’s own images and metadata, while the A3P-Social offers a community perspective of privacy setting recommendations for a user’s potential privacy improvement.

Our algorithm in A3P-core (that is now parameterized based on user groups and also factors in possible outliers), and a new A3P-social module that develops the notion of social context to refine and extend the prediction power of our system.

We design the interaction flows between the two building blocks to balance the benefits from meeting personal characteristics and obtaining community advice.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

ADMIN:

USER:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

ADMIN:

USER:

3.3 CLASS DIAGRAM:

ADMIN:

USER:

3.4 SEQUENCE DIAGRAM:

ADMIN:

USER:

3.5 ACTIVITY DIAGRAM:

ADMIN:

USER:

CHAPTER 4

4.0 IMPLEMENTATION:

A3P-CORE

There are two major components in A3P-core: (i) Image classification and (ii) Adaptive policy prediction. For each user, his/her images are first classified based on content and metadata. Then, privacy policies of each category of images are analyzed for the policy prediction. Adopting a two-stage approach is more suitable for policy recommendation than applying the common one-stage data mining approaches to mine both image features and policies together. Recall that when a user uploads a new image, the user is waiting for a recommended policy.

The two-stage approach allows the system to employ the first stage to classify the new image and find the candidate sets of images for the subsequent policy recommendation. As for the one-stage mining approach, it would not be able to locate the right class of the new image because its classification criteria need both image features and policies whereas the policies of the new image are not available yet. Moreover, combining both image features and policies into a single classifier would lead to a system which is very dependent to the specific syntax of the policy. If a change in the supported policies were to be introduced, the whole learning model would need to change.

A3P-SOCIAL

The A3P-social employs a multi-criteria inference mechanism that generates representative policies by leveraging key information related to the user’s social context and his general attitude toward privacy. As mentioned earlier, A3Psocial will be invoked by the A3P-core in two scenarios. One is when the user is a newbie of a site, and does not have enough images stored for the A3P-core to infer meaningful and customized policies. The other is when the system notices significant changes of privacy trend in theuser’s social circle, which may be of interest for the user to possibly adjust his/her privacy settings accordingly. In what follows, we first present the types of social context considered by A3P-Social, and then present the policy recommendation process.

4.1 ALGORITHM

Our algorithm performs better for users with certain characteristics. Therefore, we study possible factors relevant to the performance of our algorithm. We used a least squares multiple regression analysis, regressing performance of the A3P-core to the following possible predictors:

4.2 MODULES:

WEB-BASED IMAGE SHARING SERVICES:

METADATA-BASED CLASSIFICATION:

CONTENT-BASED CLASSIFICATION:

ADAPTIVE POLICY PREDICTION:

4.3 MODULE DESCRIPTION:

WEB-BASED IMAGE SHARING SERVICES:

Sharing images within online content sharing sites, therefore, may quickly lead to unwanted disclosure and privacy violations. Further, the persistent nature of online media makes it possible for other users to collect rich aggregated information about the owner of the published content and the subjects in the published content. The aggregated information can result in unexpected exposure of one’s social environment and lead to abuse of one’s personal information. We expected that frequency of sharing pictures and frequency of changing privacy settings would be significantly related to performance, but the results indicate that the frequency of social network use, frequency of uploading images and frequency of changing settings are not related to the performance our algorithm obtains with privacy settings predictions. This is a particularly useful result as it indicates that our algorithm will perform equally well for users who frequently use and share images on social networks as well as for users who may have limited access or limited information to share.

METADATA-BASED CLASSIFICATION:

We propose a hierarchical image classification which classifies images first based on their contents and then refine each category into subcategories based on their metadata. Images that do not have metadata will be grouped only by content. Such a hierarchical classification gives a higher priority to image content and minimizes the influence of missing tags. Note that it is possible that some images are included in multiple categories as long as they contain the typical content features or metadata based classification groups’ images into subcategories under aforementioned baseline categories.

The process consists of three main steps.

The third step is to find a subcategory that an image belongs to. This is an incremental procedure. At the beginning, the first image forms a subcategory as itself and the representative hypernyms of the image becomes the subcategory’s representative hypernyms. Then, we compute the distance between representative hypernyms of a new incoming image and each existing subcategory.

CONTENT-BASED CLASSIFICATION:

Our approach to content-based classification is based on an efficient and yet accurate image similarity approach. Specifically, our classification algorithm compares image signatures defined based on quantified and sanitized version of Haar wavelet transformation. For each image, the wavelet transform encodes frequency and spatial information related to image color, size, invariant transform, shape, texture, symmetry, etc. Then, a small number of coefficients are selected to form the signature of the image. The content similarity among images is then determined by the distance among their image signatures.

Our selected similarity criteria include texture, symmetry, shape (radial symmetry and phase congruency and SIFT. We also account for color and size. We set the system to start from five generic image classes: (a) explicit (e.g., nudity, violence, drinking etc), (b) adults, (c) kids, (d) scenery (e.g., beach, mountains), (e) animals. As a preprocessing step, we populate the five baseline classes by manually assigning to each class a number of images crawled from Google images, resulting in about 1,000 images per class. Having a large image data set beforehand reduces the chance of misclassification. Then, we generate signatures of all the images and store them in the database.

Our content classifier, we conducted some preliminary test to evaluate its accuracy. Precisely, we tested our classifier it against a ground-truth data set, Image-net.org. In Image-net, over 10 million images are collected and classified according to the wordnet structure. For each image class, we use the first half set of images as the training data set and classify the next 800 images. The classification result was recorded as correct if the synset’s main search term or the direct hypernym is returned as a class. The average accuracy of our classifier is above 94 percent.

ADAPTIVE POLICY PREDICTION:

The policy prediction algorithm provides a predicted policy of a newly uploaded image to the user for his/her reference. More importantly, the predicted policy will reflect the possible changes of a user’s privacy concerns. The prediction process consists of three main phases: (i) policy normalization; (ii) policy mining; and (iii) policy prediction. The policy normalization is a simple decomposition process to convert a user policy into a set of atomic rules in which the data (D) component is a single-element set.

We propose a hierarchical mining approach for policy mining. Our approach leverages association rule mining techniques to discover popular patterns in policies. Policy mining is carried out within the same category of the new image because images in the same category are more likely under the similar level of privacy protection. The basic idea of the hierarchical mining is to follow a natural order in which a user defines a policy.

Given an image, a user usually first decides who can access the image, then thinks about what specific access rights (e.g., view only or download) should be given, and finally refine the access conditions such as setting the expiration date. Correspondingly, the hierarchical mining first look for popular subjects defined by the user, then look for popular actions in the policies containing the popular subjects, and finally for popular conditions in the policies containing both popular subjects and conditions.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

A3P-Social, we achieve a much higher accuracy, demonstrating that just simply considering privacy inclination is not enough, and that ”social-context” truly matters. Precisely the overall accuracy of A3P-social is above 95 percent. For 88.6 percent of the users, all predicted policies are correct, and the number of missed policies is 33 (for over 2,600 predictions). Also, we note that in this case, there is no significant difference across image types.

We compared the performance of the A3P-Social with alternative, popular, recommendation methods: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. In our case, the vectors are the users’ attributes defining their social profile. The algorithm using Cosine similarity scans all users profiles, computes Cosine similarity of the social contexts between the new user and the existing users. Then, it finds the top two users with the highest similarity score with the candidate user and feeds the associated images to the remaining functions in the A3P-core.

We have proposed an Adaptive Privacy Policy Prediction (A3P) system that helps users automate the privacy policy settings for their uploaded images. The A3P system provides a comprehensive framework to infer privacy preferences based on the information available for a given user. We also effectively tackled the issue of cold-start, leveraging social context information. Our experimental study proves that our A3P is a practical tool that offers significant improvements over current approaches to privacy.

Predicting Asthma-Related Emergency Department Visits Using Big Data

05/08/201902/07/2019 by admin

Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% precision. The results can be helpful for public health surveillance, emergency department preparedness, and, targeted patient interventions.

1.2 INTRODUCTION:

Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.

Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.

Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

Another notable non-traditional disease surveillance systemhas been a data-aggregating tool called Google Flu Trends which uses aggregated search data to estimate flu activity. Google Trends was quite successful in its estimation of influenza-like illness. It is based on Google’s search engine which tracks how often a particular search-term is entered relative to the total search-volume across a particular area. This enables access to the latest data from web search interest trends on a variety of topics, including diseases like asthma. Air pollutants are known triggers for asthma symptoms and exacerbations. The United States Environmental Protection Agency (EPA) provides access to monitored air quality data collected at outdoor sensors across the country which could be used as a data source for asthma prediction. Meanwhile, as health reform progresses, the quantity and variety of health records being made available electronically are increasing dramatically. In contrast to traditional disease surveillance systems, these new data sources have the potential to enable health organizations to respond to chronic conditions, like asthma, in real time. This in turn implies that health organizations can appropriately plan for staffing and equipment availability in a flexible manner. They can also provide early warning signals to the people at risk for asthma adverse events, and enable timely, proactive, and targeted preventive and therapeutic interventions.

1.3 LITRATURE SURVEY:

USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION

AUTHOR: Kim, Eui-Ki, et al.

PUBLISH: PloS one vol. 8, no.7, e69305, 2013.

EXPLANATION:

Influenza epidemics arise through the accumulation of viral genetic changes. The emergence of new virus strains coincides with a higher level of influenza-like illness (ILI), which is seen as a peak of a normal season. Monitoring the spread of an epidemic influenza in populations is a difficult and important task. Twitter is a free social networking service whose messages can improve the accuracy of forecasting models by providing early warnings of influenza outbreaks. In this study, we have examined the use of information embedded in the Hangeul Twitter stream to detect rapidly evolving public awareness or concern with respect to influenza transmission and developed regression models that can track levels of actual disease activity and predict influenza epidemics in the real world. Our prediction model using a delay mode provides not only a real-time assessment of the current influenza epidemic activity but also a significant improvement in prediction performance at the initial phase of ILI peak when prediction is of most importance.

A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS

AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.

PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.

EXPLANATION:

Traditional disease surveillance is a very time consuming reporting process. Cases of notifiable diseases are reported to the different levels in the national health care system before actions can be taken. But, early detection of disease activity followed by a rapid response is crucial to reduce the impact of epidemics. To address this challenge, alternative sources of information are investigated for disease surveillance. In this paper, the relevance of twitter messages outbreak detection is investigated from two directions. First, Twitter messages potentially related to disease outbreaks are retrospectively searched and analyzed. Second, incoming twitter messages are assessed with respect to their relevance for outbreak detection. The studies show that twitter messages can be – to a certain extent – highly relevant for early detecting hints to public health threats. According to the law on German Protection against Infection Act (Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance relies on data from mandatory reporting of cases by physicians and laboratories. They inform local county health departments (Landkreis) which in turn report to state health departments (Land). At the end of the reporting pipeline, the national surveillance institute (Robert Koch Institute) is informed about the outbreak. It is clear that these different stages of reporting take time and delay a timely reaction.

TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES

AUTHOR: Culotta, Aron.

PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.

EXPLANATION:

Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.

Google messages and their relevance for disease outbreak detection has been reported already that especially tweets are useful to predict outbreaks such as a Norovirus outbreak at a university analysed twitter news during the influenza epidemic 2009. They compared the use of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed the content of the tweets (ten content concepts) and validated twitter as a the real time content. They analysed the data via Infovigil an infosurveillance system by using an automated coding. To find out if there is a relationship between automated and manual coding, the tweets were evaluated by a Pearson´s correlation. Chew et al. found a significant correlation between both coding in seven content concept it needs to be investigated whether this source might be of relevance for detecting disease outbreaks in Germany. Therefore, only German keywords are exploited to identify Twitter messages. Further, we are not only interested in influenza-like illnesses as the studies available so far, but also in other infectious diseases (e.g. Norovirus and Salmonella).

2.1.1 DISADVANTAGES:

Existing methods have a common format:

[username]

[text] [date time client]. The length is restricted to 140 characters. In terms of linguistics, each twitter user can write as he or she likes. Thus, the variety reaches from complete sentences to listing of keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote topics and are primarily utilized by experienced users categories google according to their contents in more details, google messages can • Provide information, • Express opinions or • Report personal issues is provided, the authority of that information cannot normally not be determined, so it might be unverified information. Opinions are often expressed with humor or sarcasm and may be highly contradictive in the emotions that are expressed.

2.2 PROPOSED SYSTEM:

Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

Research studies have explored the use of novel data sources to propose rapid, cost-effective health status surveillance methodologies. Some of the early studies rely on document classification suggesting that Twitter data can be highly relevant for early detection of public health threats. Others employ more complex linguistic analysis, such as the Ailment Topic Aspect Model which is useful for syndrome surveillance. This type of analysis is useful for demonstrating the significance of social media as a promising new data source for health surveillance. Other recent studies have linked social media data with real world disease incidence to generate actionable knowledge useful for making health care decisions. These include which analyzed Twitter messages related to influenza and correlated them with reported CDC statistics validated Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Collier employed supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories. This study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data.

2.2.1 ADVANTAGES:

Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.

The main contributions of this work are:

• Analysis of tweets with respect to their relevance for disease surveillance,

• Content analysis and content classification of tweets,

• Linguistic analysis of disease-reporting twitter messages,

• Recommendations on search patterns for tweet search in the context of disease surveillance.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Back End : MYSQL Server
Server : Apache Tomact Server
Script : JSP Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

Alert Email

Filter Tweet

Asthma Tweets

New Tweet

Friend Follow

Friends list

CHAPTER 4

4.0 IMPLEMENTATION:

DISEASE CONTROL AND PREVENTION (CDC):

Current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments [4]. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that are not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection.

Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information.

4.1 ALGORITHM:

MACHINE LEARNING ALGORITHMS:

Our research objective is to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days). To this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.

4.2 MODULES:

EMERGENCY DEPARTMENT VISITS:

ENVIRONMENTAL SENSORS (EMR):

OUR PREDICTION SENSOR DATA:

ASTHMA PREDICTION RESULTS:

4.3 MODULE DESCRIPTION:

EMERGENCY DEPARTMENT VISITS:

We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers.

The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments. Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics.

ENVIRONMENTAL SENSORS (EMR):

Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment.

For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period. Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.

OUR PREDICTION SENSOR DATA:

We first analyzed the association between the asthma-related ED visits and data from Twitter, Google trends, and Air Quality sensors, using the Pearson correlation coefficient. We also examined the association between asthma-related tweet counts and ED visit counts for abdominal pain/constipation patients, to control for non-asthma-specific variations in ED visit counts. Then, we designed and implemented a prediction model to estimate the incidence of asthma ED visits at CMC using a combination of independent variables from the above data sources.

Twitter offers streaming APIs to give developers and researchers low latency access to its global stream of data. Public streams, which can provide access to the public data flowing through Twitter, were used in this study. Studies have estimated that using Twitter’s Streaming API, researchers can expect to receive 1% of the tweets in near real-time. Twitter4j, an unofficial Java library for the Twitter API, was used to access tweet information from the Twitter Streaming API.

Two different Twitter data sets were collected in this study:

(1) The general twitter stream: a simple collection of JSON grabbed from the general Twitter stream. The general tweet counts were used to estimate the Twitter population and further normalize asthma tweet counts.

(2) The asthma-related stream: to collect only tweets containing any of 19 related keywords that were suggested by our clinical collaborators from PCCI. The asthma stream is limited to 1% of full tweets as well.

ASTHMA PREDICTION RESULTS:

Our results from the correlation analysis, asthma tweets, CO, NO2 and PM2.5 were selected as inputs into our prediction model. We are only reporting results for the Decision Tree and Artificial Neural Networks (ANN) techniques, as the Naive Bayes and SVM techniques did not yield good prediction results. First, backward feature selection algorithm was used to examine if the addition of Twitter data would improve the prediction. As shown in Table VI, combining air quality data with tweets resulted in higher prediction accuracy. Additionally, we evaluated prediction precision. Given that our prediction task is for a three-way classification, each technique resulted in different prediction and/or precision in different classes (Table VII). Decision Tree performed well in predicting the “High” class, while ANN, after Adaptive Boosting, worked well for the “Low” class. Stacking the two techniques performed well for the “Medium” class.

The results of our analysis are promising because they perform with a fairly high level of accuracy overall. As noted in the introduction, traditional asthma ED visit models are useful for predicting events in a three month window and have an accuracy of approximately 70%. It is to be noted that “traditional models” estimate a risk score for asthma ED visit for each individual patient, whereas our “Twitter/ Environmental data model” predicts the risk for a daily number of ED visits being High, Low, or Medium. The former is an individual-level risk model, while the latter is a population-level risk model. Our population-level asthma risk prediction model has the potential for complementing current individual-level models, and may lead to a shorter time window and better accuracy of prediction. This in turn has implications for better planning and proactive treatment in specific geo-locations at specific time periods.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE:

In this study, we have provided preliminary evidence that social media and environmental data can be leveraged to accurately predict asthma ED visits at a population level. We are in the process of confirming these preliminary findings by collecting larger clinical datasets across different seasons and multiple hospitals. Our continued work is focused on extending this research to propose a temporal prediction model that analyzes the trends in tweets and air quality index changes, and estimates the time lag between these changes and the number of asthma ED visits.

We also are collecting air quality index data over a longer time period to examine the effects of seasonal variations. In addition, we would like to explore the effect of relevant data from other types of social media interactions, e.g. blogs and discussion forums, on our asthma visit prediction model. Additional studies are needed to examine how combining real-time or near-real-time social media and environmental data with more traditional data might affect the performance and timing of current individual-level prediction models for asthma, and eventually, for other chronic conditions. In future projects, we intend to extend our work to diseases with geographical and temporal variability, e.g., COPD and diabetes.

Performing Initiative Data Prefetching in Distributed File Systems for Cloud Computing

05/08/201902/07/2019 by admin

1.2 INTRODUCTION

1.3 LITRATURE SURVEY

PARTIAL REPLICATION OF METADATA TO ACHIEVE HIGH METADATA AVAILABILITY IN PARALLEL FILE SYSTEMS

AUTHOR: J. Liao, Y. Ishikawa

PUBLISH: In the Proceedings of 41st International Conference on Parallel Processing (ICPP ’12), pp. 168–177, 2012.

EXPLANATION:

EVALUATING PERFORMANCE AND ENERGY IN FILE SYSTEM SERVER WORKLOADS

AUTHOR: P. Sehgal, V. Tarasov, E. Zadok

PUBLISH: the 8th USENIX Conference on File and Storage Technologies (FAST ’10), pp.253-266, 2010.

EXPLANATION:

FLEXIBLE, WIDEAREA STORAGE FOR DISTRIBUTED SYSTEMS WITH WHEELFS

AUTHOR: J. Stribling, Y. Sovran, I. Zhang and R. Morris et al

PUBLISH: In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI’09), USENIX Association, pp. 43–58, 2009.

EXPLANATION:

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

2.1.1 DISADVANTAGES:

Network delay in numerically and geographically remote file system access

Mobile devices generally have limited processing power, battery life and storage

2.2 PROPOSED SYSTEM:

This paper makes the following two contributions:

2.2.1 ADVANTAGES:

The applications that have intensive read workloads can automatically yield not only better use of available bandwidth.

Less file operations via batched I/O requests through prefetching

Cloud computing offers an illusion of infinite computing resources

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

JAVA

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

I/O ACCESS PREDICTION

4.1 ALGORITHM

MARKOV MODEL PREDICTION ALGORITHM

LINEAR PREDICTION ALGORITHM

4.2 MODULES:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

4.3 MODULE DESCRIPTION:

SERVER CLIENT MODULE:

DISTRIBUTED FILE SYSTEMS:

INITIATIVE DATA PREFETCHING:

ANALYSIS OF PREDICTIONS:

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION AND FUTURE WORK:

Passive IP Traceback Disclosing the Locations of IP Spoofers From Path Backscatter

05/08/201902/07/2019 by admin

It is long known attackers may use forged source IP address to conceal their real locations. To capture the spoofers, a number of IP traceback mechanisms have been proposed. However, due to the challenges of deployment, there has been not a widely adopted IP traceback solution, at least at the Internet level. As a result, the mist on the locations of spoofers has never been dissipated till now.

This paper proposes passive IP traceback (PIT) that bypasses the deployment difficulties of IP traceback techniques. PIT investigates Internet Control Message Protocol error messages (named path backscatter) triggered by spoofing traffic, and tracks the spoofers based on public available information (e.g., topology). In this way, PIT can find the spoofers without any deployment requirement.

This paper illustrates the causes, collection, and the statistical results on path backscatter, demonstrates the processes and effectiveness of PIT, and shows the captured locations of spoofers through applying PIT on the path backscatter data set.

These results can help further reveal IP spoofing, which has been studied for long but never well understood. Though PIT cannot work in all the spoofing attacks, it may be the most useful mechanism to trace spoofers before an Internet-level traceback system has been deployed in real.

1.2 INTRODUCTION

IP spoofing, which means attackers launching attacks with forged source IP addresses, has been recognized as a serious security problem on the Internet for long. By using addresses that are assigned to others or not assigned at all, attackers can avoid exposing their real locations, or enhance the effect of attacking, or launch reflection based attacks. A number of notorious attacks rely on IP spoofing, including SYN flooding, SMURF, DNS amplification etc. A DNS amplification attack which severely degraded the service of a Top Level Domain (TLD) name server is reported in though there has been a popular conventional wisdom that DoS attacks are launched from botnets and spoofing is no longer critical, the report of ARBOR on NANOG 50th meeting shows spoofing is still significant in observed DoS attacks. Indeed, based on the captured backscatter messages from UCSD Network Telescopes, spoofing activities are still frequently observed.

To capture the origins of IP spoofing traffic is of great importance. As long as the real locations of spoofers are not disclosed, they cannot be deterred from launching further attacks. Even just approaching the spoofers, for example, determining the ASes or networks they reside in, attackers can be located in a smaller area, and filters can be placed closer to the attacker before attacking traffic get aggregated. The last but not the least, identifying the origins of spoofing traffic can help build a reputation system for ASes, which would be helpful to push the corresponding ISPs to verify IP source address.

Instead of proposing another IP traceback mechanism with improved tracking capability, we propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

In this article, at first we illustrate the generation, types, collection, and the security issues of path backscatter messages in section III. Then in section IV, we present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

Our work has the following contributions:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities. Though Moore et al. [8] has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback. 2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real. 3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

1.3 LITRATURE SURVEY

DEFENSE AGAINST SPOOFED IP TRAFFIC USING HOP-COUNT FILTERING

PUBLICATION: IEEE/ACM Trans. Netw., vol. 15, no. 1, pp. 40–53, Feb. 2007.

AUTHORS: H. Wang, C. Jin, and K. G. Shin

EXPLANATION:

IP spoofing has often been exploited by Distributed Denial of Service (DDoS) attacks to: 1)conceal flooding sources and dilute localities in flooding traffic, and 2)coax legitimate hosts into becoming reflectors, redirecting and amplifying flooding traffic. Thus, the ability to filter spoofed IP packets near victim servers is essential to their own protection and prevention of becoming involuntary DoS reflectors. Although an attacker can forge any field in the IP header, he cannot falsify the number of hops an IP packet takes to reach its destination. More importantly, since the hop-count values are diverse, an attacker cannot randomly spoof IP addresses while maintaining consistent hop-counts. On the other hand, an Internet server can easily infer the hop-count information from the Time-to-Live (TTL) field of the IP header. Using a mapping between IP addresses and their hop-counts, the server can distinguish spoofed IP packets from legitimate ones. Based on this observation, we present a novel filtering technique, called Hop-Count Filtering (HCF)-which builds an accurate IP-to-hop-count (IP2HC) mapping table-to detect and discard spoofed IP packets. HCF is easy to deploy, as it does not require any support from the underlying network. Through analysis using network measurement data, we show that HCF can identify close to 90% of spoofed IP packets, and then discard them with little collateral damage. We implement and evaluate HCF in the Linux kernel, demonstrating its effectiveness with experimental measurements

DYNAMIC PROBABILISTIC PACKET MARKING FOR EFFICIENT IP TRACEBACK

PUBLICATION: Comput. Netw., vol. 51, no. 3, pp. 866–882, 2007.

AUTHORS: J. Liu, Z.-J. Lee, and Y.-C. Chung

EXPLANATION:

Recently, denial-of-service (DoS) attack has become a pressing problem due to the lack of an efficient method to locate the real attackers and ease of launching an attack with readily available source codes on the Internet. Traceback is a subtle scheme to tackle DoS attacks. Probabilistic packet marking (PPM) is a new way for practical IP traceback. Although PPM enables a victim to pinpoint the attacker’s origin to within 2–5 equally possible sites, it has been shown that PPM suffers from uncertainty under spoofed marking attack. Furthermore, the uncertainty factor can be amplified significantly under distributed DoS attack, which may diminish the effectiveness of PPM. In this work, we present a new approach, called dynamic probabilistic packet marking (DPPM), to further improve the effectiveness of PPM. Instead of using a fixed marking probability, we propose to deduce the traveling distance of a packet and then choose a proper marking probability. DPPM may completely remove uncertainty and enable victims to precisely pinpoint the attacking origin even under spoofed marking DoS attacks. DPPM supports incremental deployment. Formal analysis indicates that DPPM outperforms PPM in most aspects.

FLEXIBLE DETERMINISTIC PACKET MARKING: AN IP TRACEBACK SYSTEM TO FIND THE REAL SOURCE OF ATTACKS

PUBLICATION: EEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567–580, Apr. 2009.

AUTHORS: Y. Xiang, W. Zhou, and M. Guo

EXPLANATION:

IP traceback is the enabling technology to control Internet crime. In this paper we present a novel and practical IP traceback system called Flexible Deterministic Packet Marking (FDPM) which provides a defense system with the ability to find out the real sources of attacking packets that traverse through the network. While a number of other traceback schemes exist, FDPM provides innovative features to trace the source of IP packets and can obtain better tracing capability than others. In particular, FDPM adopts a flexible mark length strategy to make it compatible to different network environments; it also adaptively changes its marking rate according to the load of the participating router by a flexible flow-based marking scheme. Evaluations on both simulation and real system implementation demonstrate that FDPM requires a moderately small number of packets to complete the traceback process; add little additional load to routers and can trace a large number of sources in one traceback process with low false positive rates. The built-in overload prevention mechanism makes this system capable of achieving a satisfactory traceback result even when the router is heavily loaded. It has been used to not only trace DDoS attacking packets but also enhance filtering attacking traffic.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing methods of the IP marking approach is that routers probabilistically write some encoding of partial path information into the packets during forwarding. A basic technique, the edge sampling algorithm, is to write edge information into the packets. This scheme reserves two static fields of the size of IP address, start and end, and a static distance field in each packet. Each router updates these fields as follows. Each router marks the packet with a probability. When the router decides to mark the packet, it writes its own IP address into the start field and writes zero into the distance field. Otherwise, if the distance field is already zero which indicates its previous router marked the packet, it writes its own IP address into the end field, thus represents the edge between itself and the previous routers.

Previous router doesn’t mark the packet, then it always increments the distance field. Thus the distance field in the packet indicates the number of routers the packet has traversed from the router which marked the packet to the victim. The distance field should be updated using a saturating addition, meaning that the distance field is not allowed to wrap. The mandatory increment of the distance field is used to avoid spoofing by an attacker. Using such a scheme, any packet written by the attacker will have distance field greater than or equal to the length of the real attack path a router false positive if it is in the reconstructed attack graph but not in the real attack graph. Similarly we call a router false negative if it is in the true attack graph but not in the reconstructed attack graph. We call a solution to the IP traceback problem robust if it has very low rate of false negatives and false positives.

2.1.1 DISADVANTAGES:

Existing approach has a very high computation overhead for the victim to reconstruct the attack paths, and gives a large number of false positives when the denial-of-service attack originates from multiple attackers.

Existing approach can require days of computation to reconstruct the attack paths and give thousands of false positives even when there are only 25 distributed attackers. This approach is also vulnerable to compromised routers.

If a router is compromised, it can forge markings from other uncompromised routers and hence lead the reconstruction to wrong results. Even worse, the victim will not be able to tell a router is compromised just from the information in the packets it receives problem.

2.2 PROPOSED SYSTEM:

We propose a novel solution, named Passive IP Traceback (PIT), to bypass the challenges in deployment. Routers may fail to forward an IP spoofing packet due to various reasons, e.g., TTL exceeding. In such cases, the routers may generate an ICMP error message (named path backscatter) and send the message to the spoofed source address. Because the routers can be close to the spoofers, the path backscatter messages may potentially disclose the locations of the spoofers. PIT exploits these path backscatter messages to find the location of the spoofers. With the locations of the spoofers known, the victim can seek help from the corresponding ISP to filter out the attacking packets, or take other counterattacks. PIT is especially useful for the victims in reflection based spoofing attacks, e.g., DNS amplification attacks. The victims can find the locations of the spoofers directly from the attacking traffic.

We present PIT, which tracks the location of the spoofers based on path backscatter messages together with the topology and routing information. We discuss how to apply PIT when both topology and routing are known, or only topology is known, or neither are known respectively. We also present two effective algorithms to apply PIT in large scale networks. In the following section, at first we show the statistical results on path backscatter messages. Then we evaluate the two key mechanisms of PIT which work without routing information. At last, we give the tracking result when applying PIT on the path backscatter message dataset: a number of ASes in which spoofers are found.

2.2.1 ADVANTAGES:

1) This is the first article known which deeply investigates path backscatter messages. These messages are valuable to help understand spoofing activities has exploited backscatter messages, which are generated by the targets of spoofing messages, to study Denial of Services (DoS), path backscatter messages, which are sent by intermediate devices rather than the targets, have not been used in traceback.

2) A practical and effective IP traceback solution based on path backscatter messages, i.e., PIT, is proposed. PIT bypasses the deployment difficulties of existing IP traceback mechanisms and actually is already in force. Though given the limitation that path backscatter messages are not generated with stable possibility, PIT cannot work in all the attacks, but it does work in a number of spoofing activities. At least it may be the most useful traceback mechanism before an AS-level traceback system has been deployed in real.

3) Through applying PIT on the path backscatter dataset, a number of locations of spoofers are captured and presented. Though this is not a complete list, it is the first known list disclosing the locations of spoofers.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM:

LEVEL 1

Base station

View request

Router check the node

Message send via router

LEVEL 2

Node

Exists

Send request

Receive message

Check IP Address & check verification node

Clear Spoofing Attacks

LEVEL 3

Router

IP Address

Router check the each node

Check verification same/diff node to each data

Response to node

Detect Spoofing Origin and send message to original node

3.2.1 UML DIAGRAMS:

3.2.2 USE CASE DIAGRAM:

Base station

Router

Create message

View request

Message send via router

Router check each node

Check verification same/diff node to each data

Response to client

Detect spoofing origin

Send message

Node

Send request

3.2.3 CLASS DIAGRAM:

Node

IP Adress

Send request

View message ()

Base station

IP Address

View request

Send message via router

Socket connection () ()

Send message () ()

Router

IP Address

Router check the each node

Detectsppofing() ()

Receive message ()

Response to nde

Send message() ()

3.2.4 SEQUENCE DIAGRAM:

Connection established

Send encoded data

Check verification

Form routing

Routing Finished

Detect Spoofing

Connection terminate

Source

Base station

Destination

Establish communication

Connection established

Receiving Ack

Data received

Routing Success

3.2. ACTIVITY DIAGRAM:

Node

Router

Check

Check verification same/diff node to each data

Router check the each node

Clear jamming and send message to node

Response to client

Yes Start msg receive

IP Address & View request

IP Address

Send request

Message Received

Base station

Message send via router

Router check the node

CHAPTER 4

4.0 IMPLEMENTATION:

4.1 ALGORITHM

We designed an algorithm specified in Fig. 6. This algorithm first finds a shortest path from r to od. From the second vertex along the path, it checks if the removal of the vertex can break r and od. Whenever such a vertex c is found, removing the vertex from G, and the set containing all the verticals which are still connected with r is just the suspect set.

4.2 MODULES:

NETWORK SECURITY:

DENIAL OF SERVICE (DOS):

PATH BACKSCATTER:

IP SPOOFING METHOD:

IP TRACEBACK METHOD:

4.3 MODULE DESCRIPTION:

NETWORK SECURITY:

Network-accessible resources may be deployed in a network as surveillance and early-warning tools, as the detection of attackers are not normally accessed for legitimate purposes. Techniques used by the attackers that attempt to compromise these decoy resources are studied during and after an attack to keep an eye on new exploitation techniques. Such analysis may be used to further tighten security of the actual network being protected by the data’s. Data forwarding can also direct an attacker’s attention away from legitimate servers. A user encourages attackers to spend their time and energy on the decoy server while distracting their attention from the data on the real server. Similar to a server, a user is a network set up with intentional vulnerabilities. Its purpose is also to invite attacks so that the attacker’s methods can be studied and that information can be used to increase network security.

DENIAL OF SERVICE (DOS):

In computing, a denial-of-service (DoS) attack is an attempt to make a machine or network resource unavailable to its intended users, such as to temporarily or indefinitely interrupt or suspend services of a host connected to the Internet. A distributed denial-of-service (DDoS) is where the attack source is more than one, often thousands of, unique IP addresses. It is analogous to a group of people crowding the entry door or gate to a shop or business, and not letting legitimate parties enter into the shop or business, disrupting normal operations.

Criminal perpetrators of DoS attacks often target sites or services hosted on high-profile web servers such as banks, credit card payment gateways; but motives of revenge, blackmail or activism can be behind other attacks. A denial-of-service attack is characterized by an explicit attempt by attackers to prevent legitimate users of a service from using that service. There are two general forms of DoS attacks: those that crash services and those that flood services.

The most serious attacks are distributed and in many or most cases involve forging of IP sender addresses (IP address spoofing) so that the location of the attacking machines cannot easily be identified, nor can filtering be done based on the source address.

PATH BACKSCATTER:

We presented a preliminary statistical result on path backscatter messages and discussed it is possible to trace spoofers based on the messages. However, the generation and collection of path backscatter messages are not well investigated, and the traceback mechanisms are not designed. In this article, we make a thorough analysis on path backscatter messages, present the traceback mechanisms and give the traceback results. 2. Each message contains the source address of the reflecting device, and the IP header of the original packet. Thus, from each path backscatter, we can get 1) the IP address of the reflecting device which is on the path from the attacker to the destination of the spoofing packet; 2) the IP address of the original destination of the spoofing packet. The original IP header also contains other valuable information, e.g., the remaining TTL of the spoofing packet. Note that due to some network devices may perform address rewrite (e.g., NAT), the original source address and the destination address may be different.

IP SPOOFING METHOD:

Our tracking mechanisms actually have two limitations. The first is the network topology and mapping from addresses of r and od must be known. The second is the tracking is actually performed based on loose assumptions on paths. Thus, only when path backscatter messages are from very special vertex, i.e., stub AS, the spoofer can be accurately located. In this section, we discuss how to break these limitations through using other information contained in path backscatter messages.

We found there are three special types of path backscatter messages which are more useful for tracing spoofers:

1) The path backscatter messages whose original hop count is 0 or 1. Such messages are generated 1 or 2 hops from the spoofers. Very possibly they are from the gateway of the spoofer.

2) The path backscatter messages whose type is ‘Redirect’. Such messages must be from a gateway of the spoofer.

3) The path backscatter messages whose original destination is a private address or unallocated address. Such messages are typically generated by the first DFZ router on the path from the spoofer to the original destination, e.g., the egress router of the AS in which the spoofer resides. Though such path backscatter messages are generated in very special cases, they are not rare. Especially, there are a large number of path backscatter messages whose original destination is a private address.

IP TRACEBACK METHOD:

PIT is very different from any existing traceback mechanism. The main difference is the generation of path backscatter message is not of a certain probability. Thus, we separate the evaluation into 3 parts: the first is the statistical results on path backscatter messages; the second is the evaluation on the traceback mechanisms presented in considering uncertainness of path backscatter generation, since effectiveness of the mechanisms is actually determined by the structure features of the networks; the last is the result of performing the traceback mechanisms on the path backscatter message dataset.

In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known. We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION:

We try to dissipate the mist on the the locations of spoofers based on investigating the path backscatter messages. In this article, we proposed Passive IP Traceback (PIT) which tracks spoofers based on path backscatter messages and public available information. We illustrate causes, collection, and statistical results on path backscatter. We specified how to apply PIT when the topology and routing are both known, or the routing is unknown, or neither of them are known.

We presented two effective algorithms to apply PIT in large scale networks and proofed their correctness. We demonstrated the effectiveness of PIT based on deduction and simulation. We showed the captured locations of spoofers through applying PIT on the path backscatter dataset. These results can help further reveal IP spoofing, which has been studied for long but never well understood.

Optimal Configuration of Network Coding in Ad Hoc Networks

05/08/201902/07/2019 by admin

Abstract:

Analyze the impact of network coding (NC) configuration on the performance of ad hoc networks with the consideration of two significant factors, namely, the throughput loss and the decoding loss, which are jointly treated as the overhead of NC. In particular, physical-layer NC and random linear NC are adopted in static and mobile ad hoc networks (MANETs), respectively. Furthermore, we characterize the good put and delay/good put tradeoff in static networks, which are also analyzed in MANETs for different mobility models (i.e., the random independent and identically distributed mobility model and the random walk model) and transmission schemes.

Introduction:

Network coding was initially designed as a kind of Source coding. Further studies showed that the Capacity of wired networks can be improved by network coding (NC), which can fully utilize the network resources.

Due to This advantage, how to employ NC in wireless ad hoc networks has been intensively studied in recent years with the Purpose of improving the throughput and delay performance. The main difference between wired networks and wireless Networks is that there is non ignorable interference between Nodes in wireless networks.

Therefore, it is important to design the NC in wireless ad hoc networks with interference to achieve the improvement on system performance such as good put and delay/good put tradeoff.

Existing System:

The probability that the random linear NC was valid for a multicast connection problem on an arbitrary network with independent sources was at least (1 − d/q)η, where η was the number of links with associated random coefficients, d was the number of receivers, and q was the size of Galois field Fq.

It was obvious that a large q was required to guarantee that the system with RLNC was valid. When considering the given two factors, the traditional definition of throughput in ad hoc networks is no longer appropriate since it does not consider the bits of NC coefficients and the linearly correlated packets that do not carry any valuable data. Instead, the good put and the delay/good put tradeoff are investigated in this paper, which only take into account the successfully decoded data.

Moreover, if we treat the data size of each packet, the generation size (the number of packets that are combined by NC as a group), and the NC coefficient Galois field as the configuration of NC, it is necessary to find the scaling laws of the optimal configuration for a given network model and transmission scheme.

Disadvantages:

Throughput loss.
The decoding loss.
Time delay.

Proposed System:

Proposed system with the basic idea of NC and the scaling laws of throughput loss and decoding loss. Furthermore, some useful concepts and parameters are listed. Finally, we give the definitions of some network performance metrics.

Physical layer Network Coding designed based on the channel state information (CSI) and network topology. The PNC is appropriate for the static networks since the CSI and network topology are preknown in the static case.

There are G nodes in one cell, and node i (i = 1, 2, . . . , G) holds packet xi. All of the G packets are independent, and they belong to the same unicast session. The packets are transmitted to a node i’ in the next cell simultaneously. gii’ is a complex number that represents the CSI between i and i’ in the frequency domain.

Advantages:

System minimizes data loss.
System reduces time delay.

Modules:

Network Topology:

The networks that consist of n randomly and evenly distributed static nodes in a unit square area. These nodes are randomly grouped into S–D pairs.

Transmission Model:

The protocol model, which is a simplified version of the physical model since it ignores the long-distance interference and transmission. Moreover, it is indicated in that the physical model can be treated as the protocol model on scaling laws when the transmission is allowed if the signal-to-interference-plus-noise ratio is larger than a given threshold.

Transmission Schemes for Mobile Networks:

When the relay receives a new packet, it combines the packet it has with that it receives by randomly selected coefficients and then generates a new packet. Simultaneous transmission in one cell is not allowed since it is hard for the receiver to obtain multiple CSI from different transmitters at the same time. Hence, we employ the random linear NC for mobile models.

Conclusion:

Analyzed the NC configuration in both static and mobile ad hoc networks to optimize the delay good put tradeoff and the good put with the consideration of the

Through put loss and decoding loss of NC. These results reveal the impact of network scale on the NC system, which has not been studied in previous works. Moreover, we also compared the performance with the corresponding networks without NC.

On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications

05/08/201902/07/2019 by admin

MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.2 INTRODUCTION

MapReduce has emerged as the most popular computing framework for big data processing due to its simple programming model and automatic management of parallel execution. MapReduce and its open source implementation Hadoop have been adopted by leading companies, such as Yahoo!, Google and Facebook, for various big data applications, such as machine learning bioinformatics and cybersecurity. MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result.

There is a shuffle step between map and reduce phase.

In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications. For example, with tens of thousands of machines, data shuffling accounts for 58.6% of the cross-pod traffic and amounts to over 200 petabytes in total in the analysis of SCOPE jobs. For shuffle-heavy MapReduce tasks, the high traffic could incur considerable performance overhead up to 30-40 % as shown in default, intermediate data are shuffled according to a hash function in Hadoop, which would lead to large network traffic because it ignores network topology and data size associated with each key.

We consider a toy example with two map tasks and two reduce tasks, where intermediate data of three keys K1, K2, and K3 are denoted by rectangle bars under each machine. If the hash function assigns data of K1 and K3 to reducer 1, and K2 to reducer 2, a large amount of traffic will go through the top switch. To tackle this problem incurred by the traffic-oblivious partition scheme, we take into account of both task locations and data size associated with each key in this paper. By assigning keys with larger data size to reduce tasks closer to map tasks, network traffic can be significantly reduced. In the same example above, if we assign K1 and K3 to reducer 2, and K2 to reducer 1, as shown in Fig. 1(b), the data transferred through the top switch will be significantly reduced.

To further reduce network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced.

In this paper, we jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

1.3 LITRATURE SURVEY

MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS

AUTHOR: Dean and S. Ghemawat

PUBLISH: Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

EXPLANATION:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

CLOUDBLAST: COMBINING MAPREDUCE AND VIRTUALIZATION ON DISTRIBUTED RESOURCES FOR BIOINFORMATICS APPLICATIONS

AUTHOR: A. Matsunaga, M. Tsugawa, and J. Fortes,

PUBLISH: IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.

EXPLANATION:

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

MAP TASK SCHEDULING IN MAPREDUCE WITH DATA LOCALITY: THROUGHPUT AND HEAVY-TRAFFIC OPTIMALITY

AUTHOR: W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang

PUBLISH: INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.

EXPLANATION:

Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay.

We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little’s law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing problem of optimizing network usage in MapReduce scheduling in the reason that we are interested in network usage is twofold. Firstly, network utilization is a quantity of independent interest, as it is directly related to the throughput of the system. Note that the total amount of data processed in unit time is simply (CPU utilization)·(CPU capacity)+ (network utilization)·(network capacity). CPU utilization will always be 1 as long as there are enough jobs in the map queue, but network utilization can be very sensitive to scheduling network utilization has been identified as a key component in optimization of MapReduce systems in several previous works.

Network usage could lead us to algorithms with smaller mean response time. We find the main motivation for this direction of our work in the results of the aforementioned overlap between map and shuffle phases, are shown to yield significantly better mean response time than Hadoop’s fair scheduler. However, we observed that neither of these two algorithms explicitly attempted to optimize network usage, which suggested room for improvement. MapReduce has become one of the most popular frameworks for large-scale distributed computing, there exists a huge body of work regarding performance optimization of MapReduce.

For instance, researchers have tried to optimize MapReduce systems by efficiently detecting and eliminating the so-called “stragglers” providing better locality of data preventing starvation caused by large jobs analyzing the problem from a purely theoretical viewpoint of shuffle workload available at any given time is closely related to the output rate of the map phase, due to the inherent dependency between the map and shuffle phases. In particular, when the job that is being processed is ‘map-heavy,’ the available workload of the same job in the shuffle phase is upper-bounded by the output rate of the map phase. Therefore, poor scheduling of map tasks can have adverse effects on the throughput of the shuffle phase, causing the network to be idle and the efficiency of the entire system to decrease.

2.1.1 DISADVANTAGES:

Existing model, called the overlapping tandem queue model, is a job-level model for MapReduce where the map and shuffle phases of the MapReduce framework are modeled as two queues that are put in tandem. Since it is a job-level model, each job is represented by only the map size and the shuffle size simplification is justified by the introduction of two main assumptions. The first assumption states that each job consists of a large number of small-sized tasks, which allows us to represent the progress of each phase by real numbers.

The job-level model offers two big disadvantages over the more complicated task-level models.

Firstly, it gives rise to algorithms that are much simpler than those of task-level models, which enhances chances of being deployed in an actual system.

Secondly, the number of jobs in a system is often smaller than the number of tasks by several orders of magnitude, making the problem computationally much less strenuous note that there are still some questions to be studied regarding the general applicability of the additional assumptions of the job-level model, which are interesting research questions in their own light

2.2 PROPOSED SYSTEM:

MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines in this locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

We have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

2.2.1 ADVANTAGES:

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations.

Our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger.

Our distributed algorithm with the other two schemes a default simulation setting with a number of parameters, and then study the performance by changing one parameter while fixing others. We consider a MapReduce job with 100 keys and other parameters are the same above. the network traffic cost shows as an increasing function of number of keys from 1 to 100 under all algorithms.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.2 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tool : Netbean 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system

The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.

DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.

DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

PROCESS:

People, procedures or devices that produce data’s in the physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

All processes must have at least one data flow in and one data flow out.
All processes should modify the incoming data, producing new forms of outgoing data.
Each data store must be involved with at least one data flow.
Each external entity must be involved with at least one data flow.
A data flow must be attached to at least one process.

3.1 ARCHITECTURE DIAGRAM

3.2 DATAFLOW DIAGRAM

SERVER

Access Layer

Cross Layer

Use Hash Partition

Traffic Aware Partition

Send data through head node

Mapper

RECEIVER

Aggregation Layer

Map Reducer

OHRA

OHNA

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

Source

Destination

Establish connection

Send the data

Data send into destination

Data Aggregation Layer

Receive data

Neighbor Nodes

View data

Base station

Form the cluster

3.3 CLASS DIAGRAM:

Source

Base station

System Address

Data Send ()

Data send

Data info

Destination address

Data Send

Transmitting ()

Destination

System Address ()

Maintain Details

Verify ()

Receive data ()

View data ()

Connection ()

Move Nodes

Node info

Data length

Hop routing ()

3.4 SEQUENCE DIAGRAM:

Connection established

Send data Data Aggregation Layer Form routing Routing Finished Traffic Aware Partition Connection terminate Source Base station Destination Establish communication Connection established Receiving Ack Data received Map Reducer

3.5 ACTIVITY DIAGRAM:

Source

Destination

False

Receive data

View data

True

False

Connection establish

Send data

Aggregation Node

Receive Ack

True

Using Mapper

Data transfer

Map Reducer

Base station

CHAPTER 4

4.0 IMPLEMENTATION:

ONLINE EXTENSION OF HRA AND HNA

In this section, we conduct extensive simulations to evaluate the performance of our proposed distributed algorithm DA. We compare DA with HNA, which is the default method in Hadoop. To our best knowledge, we are the first to propose the aggregator placement algorithm, and compared with the HRA that focuses on a random aggregator placement. All simulation results are averaged over 30 random instances.

• HNA: Hash-based partition with No Aggregation. It exploits the traditional hash partitioning for the intermediate data, which are transferred to reducers without going through aggregators. It is the default method in Hadoop.

• HRA: Hash-based partition with Random Aggregation. It adds a random aggregator placement algorithm based on the traditional Hadoop. Through randomly placing aggregators in the shuffle phase, it aims to reducing the network traffic cost in the comparison of traditional method in Hadoop.

Our proposed distributed algorithm and the optimal solution obtained by solving the MILP formulation. Due to the high computational complexity of the MILP formulation, we consider small-scale problem instances with 10 keys in this set of simulations. Each key associated with random data size within [1-50]. There are 20 mappers, and 2 reducers on a cluster of 20 machines. The parameter α is set to 0.5. The distance between any two machines is randomly chosen within [1-60]. As shown in Fig. 7, the performance of our distributed algorithm is very close to the optimal solution. Although network traffic cost increases as the number of keys grows for all algorithms, the performance enhancement of our proposed algorithms to the other two schemes becomes larger. When the number of keys is set to 10, the default algorithm HNA has a cost of 5.0 × 104 while optimal solution is only 2.7×104 , with 46% traffic reduction.

4.1 ALGORITHM

DISTRIBUTED ALGORITHM

The problem above can be solved by highly efficient approximation algorithms, e.g., branch-and-bound, and fast off-the-shelf solvers, e.g., CPLEX, for moderate-sized input. An additional challenge arises in dealing with the MapReduce job for big data. In such a job, there are hundreds or even thousands of keys, each of which is associated with a set of variables (e.g., x p ij and y p k ) and constraints in our formulation, leading to a large-scale optimization problem that is hardly handled by existing algorithms and solvers in practice.

ONLINE ALGORITHM

We take the data size m p i and data aggregation ratio αj as input of our algorithms. In order to get their values, we need to wait all mappers to finish before starting reduce tasks, or conduct estimation via profiling on a small set of data. In practice, map and reduce tasks may partially overlap in execution to increase system throughput, and it is difficult to estimate system parameters at a high accuracy for big data applications. These motivate us to design an online algorithm to dynamically adjust data partition and aggregation during the execution of map and reduce tasks.

4.2 MODULES:

SERVER CLIENTS:

DITRIBUTED DATA:

SHEDULING TASK:

NETWORK TRAFFIC TRACES:

MAPREDUCE TASK:

4.3 MODULE DESCRIPTION:

SERVER CLIENTS:

Client-server computing or networking is a distributed application architecture that partitions tasks or workloads between service providers (servers) and service requesters, called clients. Often clients and servers operate over a computer network on separate hardware. A server machine is a high-performance host that is running one or more server programs which share its resources with clients. A client also shares any of its resources; Clients therefore initiate communication sessions with servers which await (listen to) incoming requests.

DITRIBUTED DATA:

We develop a distributed algorithm to solve the problem on multiple machines in a parallel manner. Our basic idea is to decompose the original large-scale problem into several distributively solvable subproblems that are coordinated by a high-level master problem. We jointly consider data partition and aggregation for a MapReduce job with an objective that is to minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online cases.

SHEDULING TASK:

MapReduce divides a computation into two main phases, namely map and reduce which in turn are carried out by several map tasks and reduce tasks, respectively. In the map phase, map tasks are launched in parallel to convert the original input splits into intermediate data in a form of key/value pairs. These key/value pairs are stored on local machine and organized into multiple data partitions, one per reduce task. In the reduce phase, each reduce task fetches its own share of data partitions from all map tasks to generate the final result. There is a shuffle step between map and reduce phase. In this step, the data produced by the map phase are ordered, partitioned and transferred to the appropriate machines executing the reduce phase. The resulting network traffic pattern from all map tasks to all reduce tasks can cause a great volume of network traffic, imposing a serious constraint on the efficiency of data analytic applications.

NETWORK TRAFFIC TRACES:

Network traffic within a MapReduce job, we consider to aggregate data with the same keys before sending them to remote reduce tasks. Although a similar function, called combiner has been already adopted by Hadoop, it operates immediately after a map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on different machines. As an example shown in Fig. 2(a), in the traditional scheme, two map tasks individually send data of key K1 to the reduce task. If we aggregate the data of the same keys before sending them over the top switch, as shown in Fig. 2(b), the network traffic will be reduced. We tested the real network traffic cost in Hadoop using the real data source from latest dumps files in Wikimedia (http://dumps.wikimedia.org/enwiki/latest/). In the meantime, we executed our distributed algorithm using the same data source for comparison. Since our distributed algorithm is based on a known aggregation ratio _, we have done some experiments to evaluate it in Hadoop environment.

MAPREDUCE TASK:

We focus on MapReduce performance improvement by optimizing its data transmission optimizing network usage can lead to better system performance and found that high network utilization and low network congestion should be achieved simultaneously for a job with good performance. MapReduce resource allocation system, to enhance the performance of MapReduce jobs in the cloud by locating intermediate data to the local machines or close-by physical machines locality-awareness reduces network traffic in the shuffle phase generated in the cloud data center. However, little work has studied to optimize network performance of the shuffle process that generates large amounts of data traffic in MapReduce jobs. A critical factor to the network performance in the shuffle phase is the intermediate data partition. The default scheme adopted by Hadoop is hash-based partition that would yield unbalanced loads among reduce tasks due to its unawareness of the data size associated with each key.

To overcome this shortcoming, we have developed a fairness-aware key partition approach that keeps track of the distribution of intermediate keys’ frequencies, and guarantees a fair distribution among reduce tasks. In addition to data partition, many efforts have been made on local aggregation, in-mapper combining and in-network aggregation to reduce network traffic within MapReduce jobs. have introduced a combiner function that reduces the amount of data to be shuffled and merged to reduce tasks an in-mapper combining scheme by exploiting the fact that mappers can preserve state across the processing of multiple input key/value pairs and defer emission of intermediate data until all input records have been processed. Both proposals are constrained to a single map task, ignoring the data aggregation opportunities from multiple map tasks have proposed a MapReduce-like system to decrease the traffic by pushing aggregation from the edge into the network.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

5.2.1 UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.2 FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 3 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.4 LOAD TESTING:

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 6

6.0 SOFTWARE DESCRIPTION:

6.1 JAVA TECHNOLOGY:

Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all of the following buzzwords:

Simple
- Architecture neutral
- Object oriented
- Portable
- Distributed
- High performance
- Interpreted
- Multithreaded
- Robust
- Dynamic
- Secure

6.2 THE JAVA PLATFORM:

The Java platform has two components:

The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is ported onto various hardware-based platforms.

The following figure depicts a program that’s running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

6.3 WHAT CAN JAVA TECHNOLOGY DO?

The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language.
Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.
Software components: Known as JavaBeans^TM, can plug into existing component architectures.
Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI).
Java Database Connectivity (JDBC^TM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

6.4 HOW WILL JAVA TECHNOLOGY CHANGE MY LIFE?

Get started quickly: Although the Java programming language is a powerful object-oriented language, it’s easy to learn, especially for programmers already familiar with C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure Java^TMProduct Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded “on the fly,” without recompiling the entire program.

6.5 ODBC:

6.6 JDBC:

JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after.

6.7 JDBC Goals:

The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows:

SQL Level API

SQL Conformance

JDBC must be implemental on top of common database interfaces

Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system.

Keep it simple

Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime.

Keep the common cases simple

Finally we decided to precede the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database.

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral

Object-oriented Portable

Distributed High-performance

Interpreted Multithreaded

Robust Dynamic Secure

Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

Java Program

Compilers

Interpreter

My Program

6.7 NETWORKING TCP/IP STACK:

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol.

IP datagram’s:

UDP:

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model – see later.

TCP:

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address.

Network address:

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32.

Subnet address:

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address:

8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet.

Total address:

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

Sockets:

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

6.8 JFREE CHART:

JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart’s extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types;

A flexible design that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is “open source” or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary applications.

6.8.1. Map Visualizations:

6.8.2. Time Series Chart Interactivity

6.8.3. Dashboards

6.8.4. Property Editors

CHAPTER 7

7.0 APPENDIX

7.1 SAMPLE SCREEN SHOTS:

7.2 SAMPLE SOURCE CODE:

CHAPTER 8

8.1 CONCLUSION

In this paper, we study the joint optimization of intermediate data partition and aggregation in MapReduce to minimize network traffic cost for big data applications. We propose a three-layer model for this problem and formulate it as a mixed-integer nonlinear problem, which is then transferred into a linear form that can be solved by mathematical tools. To deal with the large-scale formulation due to big data, we design a distributed algorithm to solve the problem on multiple machines. Furthermore, we extend our algorithm to handle the MapReduce job in an online manner when some system parameters are not given. Finally, we conduct extensive simulations to evaluate our proposed algorithm under both offline cases and online cases. The simulation results demonstrate that our proposals can effectively reduce network traffic cost under various network settings.
CHAPTER 9

Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care

05/08/201902/07/2019 by admin

Abstract—Intelligently extracting knowledge from social mediahas recently attracted great interest from the Biomedical andHealth Informatics community to simultaneously improve healthcareoutcomes and reduce costs using consumer-generated opinion.We propose a two-step analysis framework that focuses on positiveand negative sentiment, as well as the side effects of treatment, inusers’ forum posts, and identifies user communities (modules) andinfluential users for the purpose of ascertaining user opinion ofcancer treatment. We used a self-organizing map to analyze wordfrequency data derived from users’ forum posts. We then introduceda novel network-based approach for modeling users’ foruminteractions and employed a network partitioning method based onoptimizing a stability qualitymeasure.This allowed us to determineconsumer opinion and identify influential users within the retrievedmodules using information derived frombothword-frequency dataand network-based properties. Our approach can expand researchinto intelligently mining social media data for consumer opinionof various treatments to provide rapid, up-to-date information forthe pharmaceutical industry, hospitals, and medical staff, on theeffectiveness (or ineffectiveness) of future treatments.Index Terms—Datamining, complex networks, neural networks,semantic web, social computing.I. INTRODUCTIONSOCIAL media is providing limitless opportunities for patientsto discuss their experiences with drugs and devices,and for companies to receive feedback on their products andservices [1]–[3]. Pharmaceutical companies are prioritizing socialnetwork monitoring within their IT departments, creatingan opportunity for rapid dissemination and feedback of productsand services to optimize and enhance delivery, increase turnoverand profit, and reduce costs [4]. Social media data harvestingfor bio-surveillance have also been reported [5].Social media enables a virtual networking environment.Modelingsocial media using available network modeling and computationaltools is one way of extracting knowledge and trendsfrom the information ‘cloud:’ a social network is a structuremade of nodes and edges that connect nodes in various relationships.Graphical representation is the most common methodto visually represent the information. Network modeling couldManuscript received January 24, 2014; revised May 4, 2014 and June 19,2014; accepted June 30, 2014. Date of publication July 10, 2014; date of currentversion December 30, 2014.A. Akay and B.-E. Erlandsson are with the School of Technology andHealth, Royal Institute of Technology, Stockholm SE-141 52, Sweden (e-mail:altu@kth.se; bjorn-erik.erlandsson@sth.kth.se).A. Dragomir, is with the Department of Biomedical Engineering, Universityof Houston, Houston, TX 77204–5060 USA (e-mail: adragomir@uh.edu).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JBHI.2014.2336251also be used for studying the simulation of network propertiesand its internal dynamics.A sociomatrix can be used to construct representations ofa social network structure. Node degree, network density, andother large-scale parameters can derive information about theimportance of certain entities within the network. Such communitiesare clusters or modules. Specific algorithms can performnetwork-clustering, one of the fundamental tasks in networkanalysis. Detecting particular user communities requires identifyingspecific, networked nodes that will allow informationextraction. Healthcare providers could use patient opinion toimprove their services. Physicians could collect feedback fromother doctors and patients to improve their treatment recommendationsand results. Patients could use other consumers’knowledge in making better-informed healthcare decisions.The nature of social networks makes data collection difficult.Several methods have been employed, such as link mining [6],classification through links [7], predictions based on objects[8], links [9], existence [10], estimation [11], object [12], group[13], and subgroup detection [14], and mining the data [15],[16]. Link prediction, viral marketing, online discussion groups(and rankings) allow for the development of solutions based onuser feedback.Traditional social sciences use surveys and involve subjectsin the data collection process, resulting in small sample sizes perstudy.With social media, more content is readily available, particularlywhen combined with web-crawling and scraping softwarethat would allow real-time monitoring of changes withinthe network.Previous studies used technical solutions to extract user sentimenton influenza [17], technology stocks [18], context andsentence structure [19], online shopping [20], multiple classifications[21], government health monitoring [22], specific termsrelating to consumer satisfaction [23], polarity of newspaper articles[24], and assessment of user satisfaction from companies[25], [26]. Despite the extensive literature, none have identifiedinfluential users, and how forum relationships affect networkdynamics.In the first stage of our current study, we employ exploratoryanalysis using the self-organizing maps (SOMs) to assess correlationsbetween user posts and positive or negative opinionon the drug. In the second stage, we model the users and theirposts using a network-based approach. We build on our previousstudy [27] and use an enhanced method for identifying usercommunities (modules) and influential users therein. The currentapproach effectively searches for potential levels of organization(scales) within the networks and uncovers dense modules2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 211Fig. 1. Processing tree in Rapidminer to ascertain the TF-IDF scores of wordsin the datausing a partition stability quality measure [28]. The approach enablesus to find the optimal network partition. We subsequentlyenrich the retrieved modules with word frequency informationfrom module-contained users posts to derive local and globalmeasures of users opinion and raise flag on potential side effectsof Erlotinib, a drug used in the treatment of one of the mostprevalent cancers: lung cancer [29].II. METHODSA. Initial Data Search and CollectionWe first searched for the most popular cancer message boards.We initially focused on the number of posts on lung cancer. Thechart below gives the number of posts of lung cancer per forum:Forums Posts on Lung CancerCancer-forums.net 36 051cancerforums.net 34 328forums.stupidcancer.org 17csn.cancer.org/forum 7959We chose lung cancer because, according to the most recentstatistics, it is the most commonly diagnosed cancer in theworld for both sexes [30], and the second most prevalent cancerin the US between both sexes [31], [32]. We then compiled alist of drugs used by lung cancer patients to ascertain whichdrug was the most discussed in the forums. The drug Erlotinib(trade name Tarceva) was the most frequently discussed drugin the message boards. A further search revealed that Cancerforums.net, despite having slightly fewer posts on lung cancer, hadmore posts dedicated to Erlotinib than the other three messageboards mentioned above.Next, we performed a search of the drug, using both thetrade name (Tarceva) and drug name (Erlotinib). The trade namegarnered more results (498) compared to the drug name (66).The search using the trade name returned 920 posts, from 2009to the present date.B. Initial Text Mining and PreprocessingA Rapidminer (www.rapidminer.com) [33] data collectionand processing tree was developed to look for the most commonpositive and negative words, and their term-frequency-inversedocument frequency (TF-IDF) scores within each post. Fig. 1shows the data collection and processing tree. We initially uploadedthe data into the first component (‘Read Excel’). Theuploaded data was then processed in the second component(‘Process Documents to Data’) using several subcomponents(‘Extract Content’, ‘Tokenize’, ‘Transform Cases’, ‘Filter Stopwords’,‘Filter Tokens,’ respectively) that filtered excess noise(misspelled words, common stop words, etc.) to ensure a uniformset of variables that can be measured. The final component(‘Processed Data’) contained the final word list, with each wordcontaining a specific TF-IDF score.We then assigned weights for each of the words found in theuser posts using with the following formula:weighti,d_log tfi,d + 1) log nxt0if tft,d ≥ 1otherwisein which tfi,d represents the word frequency (t) in the document(d), n represents the number of documents within the entirecollection, and xt represents the number of documents where toccurs [30].C. Cataloging and Tagging Text DataText data containing the highest TF-IDF scores were taggedwith a modified NLTK toolkit (http://www.nltk.org/) [34] usingMATLAB to ensure that they reflected the negativity of a negativeword and the positivity of a positive word in context. Thisapproach was used before using negative tags on positive words[35]. We added a positive tag on negative words. We used theNLTK toolkit for the analysis, and classification, of words tomatch their exact meanings within the contextual settings. Forexample, the context should be considered in phrases such as ‘Ido not feel great’ so that the term ‘great’ would be adequatelytagged as a negative one (in our case it is tagged as ‘great_n’before it is returned to its specific position). Das and Chen useda similar approach in classifying words [18]. We went one stepfurther and considered positive tag on negative words. A sentencethat states ‘No side effects so I am happy!’ resulted in theword ‘No’ being tagged as ‘No_p’ (reflecting its positive context)before it is returned to its specific position. These taggedwords were thus reclassified based on the context of the post.We then reduced the number of similar words, both manually(checking the words using online dictionaries such asMerriam-Webster (http://www.merriam-webster.com/), and automatically(synonym database software such as the ThesaurusSynonym Database (http://www.language-databases.com/) andGoogle’s synonym search finder.Our finalwordlistwas pruned using the aforementioned methods,with the results displayed in Table I, with the division ofboth positive and negative words.We eliminated each word that appeared less than ten times.This allowed us to achieve a uniform set of measurements whileeliminating statistically insignificant outliers. The end resultwas a modified wordlist of 110 words (55 positive words and55 negative words) shown in Table I.In a parallel procedure, we automatically browsed the userposts to look for side effects of Erlotinib. To this goal, weused the National Library of Medicine’s Medical SubjectHeading (MeSH), which is a controlled vocabulary212 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IFINAL POSTANALYSIS WORDLISTPositive NegativeAgree BadAppreciate CannotBeneficial ConcernBenefit ConcernsComfort DamageComfortable DangerousEase DepressionEasier DidnEffective DiedEnjoy DifficultFavorable DiscomfortFavorably Don’tFeasible DoubtGood ErrorGrateful FailureGreat FearGreater HardGreatest HasnGreatly HateHelp HurtHelped ImpossibleHelpful IsnHelping LackHelps LimitedHonest LoseHonestly LossHope LostHoped MissHopeful NastyHopefully NauseaHopes NegativeHoping NoImportance NotImportant PainImportantly PainfulImpresses PoorImprove ProblemImproved ProblemsImprovement SadImproves SacredImproving ScaryInspiration SevereLike SorryLove SucksLoved SufferPositive SufferingRight TerribleSuccess UnableSuccessful UnfortunatelySupport WasnThank WeakThanks WorriedUseful WorseWell WorstWonderful Wrong(http://www.nlm.nih.gov/mesh/) that consists of a hierarchy ofdescriptors and qualifiers that are used to annotate medicalterms. A custom designed program was used to map wordsin the forum to the MeSH database. A list of words present inforum posts that were associated to treatment side effects wasthus compiled. This was done by selecting the words simultaneouslyannotated with a specific list of qualifiers in MeSH (CI– chemically induced; CO – complications; DI – diagnosis; PA– pathology, and PP – physiopathology).We then compared theTABLE IIFINAL SIDE EFFECTS WORDLISTAcneCachexiaHeadachesItchingLesionPneumoniaRashTremorWeaknessVomitingFig 2. Thread model where nodes represent users/posts and the edges representinformation transferred among users.full list of side effect words with the results that were fed into theRapidminer processing tree: we kept the side effect words withthe highest TF-IDF scores (ensuring that each word appeared atleast ten times in the forum posts).Table II shows the final wordlist of the side effects. We subjectedthe initial side effect wordlist with the same methods thatwere used in Table I.After these preprocessing steps, our forum data was representedas two sets of vectors containing the TF-IDF scores ofthe words in the two wordlist. Namely, each user post in theforum was thus transformed into a vector of 110 variables representingthe TF-IDF scores of positive and negative words, anda 10 variable vector containing the TF-IDF scores correspondingto the side effect terms (see Fig. 2, steps A1-A3).D. Consumer Sentiment Using a SOMFor this part of the analysis, all posts were manually labeledaccording to the general user opinion observed within the postas positive and negative before feeding the collected data forexploratory analysis via SOMs. The manual labeling allowed usto use this as a method of results validation.SOMs are neural networks that produce low-dimensional representationof high-dimensional data [33]. Within this network,a layer represents output space with each neuron assigned a specificweight. The weight values reflect on the cluster content.The SOM displays the data to the network, bringing togethersimilar data weights to similar neurons.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 213The benefits and capabilities have been demonstrated wheredespite the reduction of the space size, the information, andidentification schema of the clusters remained the same [36].When new data is fed into the network, the closest weightsmatching the data change to reflect the new data. The neuronsfarther from the new data rarely change. This process continuesuntil data is no longer fed, resulting in a two-dimensional map.The SOM toolbox (www.cis.hut.fi/projects/somtoolbox) [37]was used and the SOM was fed with our first wordlist (seeTable I) TF-IDF vectors. The purposewas to assess the existenceof clusters in the data and howtheSOMweights of these clusterswould correlate to positive and negative opinion. The SOMwas trained using various map sizes, using quantization andtopographic errors as validation measures. The former is theresult of the average distance between every input vector andits best matching neuron (BMN), in addition to measuring howthe trained map fits into the input data [33]. The latter uses thestructure of the map to preserve its topology by representingits accuracy: it is calculated using the proportion of the weightsfor the first and second BMNs are farther than required formeasuring the topology.The best map size was based on the minimum values of thequantization (0.24) and topographic (10−5) errors. The wordlistdata was mapped and the emerging weights were analyzed forpositive and negative variable correlations of thewordlist.Wordsof no interest, and groups containing three or fewer words, wereeliminated.Subgroups were visually identified and analyzed for furtherinformation on the consumer opinion of Erlotinib.E. Modeling Forum Postings Using Network AnalysisDiscovering influential users was the next step in our analysis.To this goal, we built networks from forum posts andtheir replies, while accounting for content-based grouping ofposts resulting from the existing forum threads. Networks arecomposed of nodes and their connections: they are either nondirectional(a connection between two nodes without a direction)or directional (a connection with an origin and an end). Thenodal degree of the latter measures the number of connectionsfrom the origin to the destination. Four node types have beenidentified [38] within a network: Isolated, transmitter, receptor,and carrier. The network’s density measures the current numberof connections.The network-based analysis is widely used in social networkanalysis based on its ability to both model and analyze intersocialdynamics. We devised a directional network model due tothe nature of the forum under scrutiny (multiple threads withmultiple thread initiators) and its internal dynamics among themembers (members reply to thread initiators as well as to otherusers). Fig. 2 describes the approach we chose to build ournetwork, which shows how each posting-reply pair is modeled.Based on the nature of the forum, all of the posters within eachthread are context posters for the thread initiator (e.g., Node 1 isthe thread initiator in Fig. 2 and Nodes 2, 3, 4, and 5 in representcontext posters). Thus, all of the posters receive an incomingedge from the thread initiator. Some context posters respondFig. 3. Diagram describing the framework of our network-based analysis.First, the posts collected from the forum via Rapidminer are preprocessed usingthe NTLK Toolbox (Step A1) and transformed into two wordlists (Step A2). Forthis step, direct mapping to the MeSH vocabulary is used to identify words representingside-effects Based on the two wordlists, forum posts are transformedinto numerical vectors containing word-frequency based TF-IDF scores (StepA3). In parallel, forum posts and replies aremodeled as a directed network (StepB1). Obtained network is further refined to identify communities/modules ofhighly interacting users, based on the MCSD method [28] (Step B2). Finally,the two wordlist vectors datasets (their info reflecting the forum informationcontent) are overlaid onto the network modules to identify influential users andhighlight side-effects intensively discussed within the modules, respectively(Step B3).directly to another poster, using the forum option ‘Reply.’ Weused bidirectional edges to reflect the ensuing information transferfrom the poster to the replier and vice versa (in Fig. 2, Node5 is a direct replier to Node 4, as is Node 3 to Node 2). Thisuser-interactionmodel allowed us to build a network that reflectsfaithfully the information content of the forum.F. Identifying SubgraphsOur modeling framework has consequently converted the forumposts into several large directional networks containing anumber of densely connected units (or modules) (see Fig. 3,step A1). These modules have the characteristic that they aremore densely connected internally (within the unit) than externally(outside the unit). We chose a multiscale method thatuses local and global criteria for identifying the modules, whilemaximizing a partition quality measure called stability [28].The stability measure considers the network as a Markovchain, with nodes representing states and edges being possibletransitions among these states. In [28], the authors proposed anapproach in which transition probabilities for a random walk oflength t (t being the Markov time) enable multiscale analysis.With increasing scale t, larger and larger modules are found.The stability of a walk of length t can be expressed asQMt =12m_i,j_Ati , j− didj2m_∗ δ (i, j) (1)where At is the adjacency matrix, t is the length of the network,m is the number of edges, i and j are nodes, di is node i’s (and j’s)strength, and δ (i,j) function becomes one if one of the nodesbelong to the same network and zero if it does not belong to214 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015any network. At is computed as follows (in order to accountfor the random walk): At = D ·Mt , where M = D−1 · A (Dbeing the diagonal matrix containing the degree vector givingfor each node its degree) [28].The method for identifying the optimal modules is based onalternating local and global criteria that expand modules byadding neighbor nodes, reassigning nodes to different modules,and significantly overlapping modules until no further optimizationis feasible, according to (1). The approach follows similarmethods presented in [28], [39], and [40].Several partitioning schemes were obtained pending on therange of scales employed by the method, with the optimal partitioninghaving the largest stability. We named the modules thusretrieved information modules (see step A2 in Fig. 3).G. Module Average Opinion and User Average OpinionWe then proceeded to refine the information modules throughfeeding them with the information obtained from the forumposts (using the wordlist vectors). In a first step, we aimedat identifying influential users within our networks. Influentialusers are users which broker most of the information transferwithin network modules and whose opinion in terms of positiveor negative sentiment towards the treatment is ‘spread’ tothe other users within their containing modules. To this goal,we enriched the information modules obtained as described inSection II-F with the TF-IDF scores of the user posts correspondingto the users found in each module. The TF-IDF scoresfrom the wordlist of positive and negative words (see Table I)were used to build two forms of measurement. The global measure(pertaining to the whole informationmodule) is representedby the module average opinion (MAO). It examined the TF-IDFscores of postings matching the nodes in a specific moduleMAO =Sum+ − Sum−Sumall.Sum+ =__xij is the total sum of the TF-IDF scoresmatching the positive words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the positive words in the list).Sum− =__xij is the total sum of the TF-IDF scoresmatching the negative words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the negative words in the list).Sumall =_Ni=1_Mk=1 xik is the sum of both of the aforementionedsums. The unit k is the index running across variablesthroughout the entire wordlist.The local measure that illustrates specific user opinion toeach node in the module (the user average opinion, or UAO)that examines the TF-IDF scores to the average of the collectedposts of the user is the following:UAOi =Sumi+ − Sumi−Sumiall.Sumi+ =_j∈P xij is the TF-IDF score’s sum matching topositive words for the ith user’s wordlist vector. P is the indexset denoting the wordlist’s positive variables.Fig. 4. U-Matrix of the posts from the Cancerforums.net forum.Sumi− =_j∈N xij is the TF-IDF score’s sum matching tonegative words for the ith user’s wordlist vector. N is the indexset denoting the wordlist’s negative variables.Sumall =_Mj=1 xij is the total of both sums, and j is theindex of the whole wordlist.H. Information Brokers Within the Information ModulesWe first ranked individual nodes in terms of their total numberof connecting edges (in and out-degree) to identify influentialusers within the modules.We then looked nodes in each module based on the followingcriteria:1) The nodes have densest degrees within the module (highestnumber of edges).2) The UAO scores equate the signs of the MAO of thecontaining module.The nodes that qualified were dubbed information brokers,based on the aforementioned criteria. Their large nodal degreesensure increased information transfer compared to other nodeswhile their matching UAO and MAO scores reflect consistencyof positive or negative opinion within the containing module.I. Network-Based Identification of Side EffectsIn the second step of our network-based analysis, we devised astrategy for identifying potential side effects occurring duringthe treatment and which user posts on the forum highlight. Tothis goal, we overlay the TF-IDF scores of the second wordlist(see Table II) onto modules obtained in Section II-F. The TFIDFscores within each module will thus directly reflect howfrequent a certain side-effect is mentioned in module posts.Subsequently, a statistical test (such as the t-test for example)can be used to compare the values of the TF-IDF scores withinthe module to those of the overall forum population and identifyvariables (side-effects) that have significantly higher scores.Fig. 3 presents a diagram that visually describes the steps inour network-based analysis.III. RESULTSFig. 4 shows the unified matrix resulting from the SOM analysisfor the wordlist vectors corresponding to the positive andnegative terms from the message board Cancerforums.net. Asubset consisting of 30% of the data was used for training theSOM. We used a 12 × 12 map size with 110 variables correspondingto the positive and negative terms to ascertain theAKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 215TABLE IIIUSER OPINION OF ERLOTINIBSatisfaction Dissatisfaction70 percent 30 percentBREAKDOWN OF USER OPINIONFully Satisfied (23) Full Dissatisfaction (4)Satisfied Despite Side Effects (37) Dissatisfaction because of Side Effects (20)Satisfied Despite Costs (10) Dissatisfaction because of Costs (6)weight of the words corresponded to the opinion of the drugErlotinib. As mentioned in Section II, each word from the listappeared more than ten times. This achieved a uniform measurementset while eliminating statistically insignificant outliers.Much of the user’s posts converged on three areas of the map.We checked the respective nodes’ correlation with their weightvectors’ values corresponding to positive or negative words todefine the positive and negative areas of the map.The user opinion of Erlotinib was overall satisfactory, withTable III summarizing the satisfaction/dissatisfaction below:According to chart, and from our readings of both the userposts and the SOM, the most pressing concern from both campswas the side effects, which are extensively documented in themedical literature [41]–[46]. The costs of the drug were alsoanother matter of concern (albeit limited).We then proceeded to identify influential users. Our modelingapproach yielded initially a single loosely connected network,linking all users within the forum. Subsequent module identificationusing the methods described in Section II-F yielded anoptimal partitioning containing five densely connected module.We varied our scale parameter within the interval t _ [0,2] in0.1 increments, as suggested by [28]. Varying the scale parameterresulted in a set of partitions ranging from modules basedon single individual users (for scale parameter t = 0), to largemodules (for values of t close to the upper limit of the interval).The optimal partition (maximizing the quality measure in (1)was obtained for t = 1.On the Cancerforums.net message board, ten users out of the920 posts were identified as information brokers as shown inFig. 5(a)–(e) below.Densities of the retrieved modules range from 0.2 to 0.6.These density values were within the observed density valuesinterval (towards the upper limit), when compared to those generallynoted in social networks, thus confirming our networkmodeling approach [47], [48]. Information brokers were identifiedfollowing the procedure described in Sections II-G–H.Further scrutinizing these users and their containing moduleswe confirmed their connections were the densest. A thoroughreading of these ten users’ posts throughout the threads theystarted and participated in revealed that they were informativeand actively interacting with users across many threads. Othermembers sought out these ten posters for their wisdom andexperience. Their forum ‘behavior’ has confirmed to us thatthese users were the premier information brokers of the drugsErlotinib on the Cancerforums.net forum.Fig. 5. Ten users were identified as information brokers on the Cancerforums.net Forum. Modules in parts a)–e) show where these ten users reside inthe forum.216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IVSIDE-EFFECT FREQUENCY AND LOCATION IN SELECTED MODULESModule 1 (A) ‘rash’ (p − value < 0.01)‘itch’ (p − value < 0.05)Module 2 (B) ‘rash’ (p − value < 0.05)Module 5 (E) ‘rash’ (p − value < 0.01)In the last part of our analysis,we investigated whichmoduleswere significantly involved in discussing specific side effects.As described in Section II-I, retrieved modules were enrichedwith the TF-IDF scores corresponding to the side-effectwordlistvectors. For each module and each side-effect scores sample, ttestswereperformed to assess the significant difference betweenthe in-module sample and the overall forum population scores.Rash and itching were identified as the side effect terms withsignificantly higher scores in Modules 1, 2, and 5 when comparedto the overall scores population in the forum, as describedin Table IV. This reflects the fact that users grouped within thesemodules repeatedly discussed these side effects in their posts.This was confirmed by subsequent scrutiny of the respectiveposts. A literature search confirmed that rash and itching areindeed two of the most common side-effects of Erlotinib withas much as 70% of the patients affected, as indicated by clinicalstudies. [44]–[46]IV. DISCUSSIONWe converted a forum focused on oncology into weightedvectors to measure consumer thoughts on the drug Erlotinibusing positive and negative terms alongside another list containingthe side effects. Our methods were able to investigatepositive and negative sentiment on lung cancer treatment usingthe drug by mapping the large dimensional data onto a lowerdimensional space using the SOM. Most of the user data wasclustered to the area of themap linked to positive sentiment, thusreflecting the general positive view of the users. Subsequent networkbased modeling of the forum yielded interesting insightson the underlying information exchange among users. Modulesof strongly interacting users were identified using a multiscalecommunity detection method described in [28]. By overlayingthese modules with content-based information in the form ofword-frequency scores retrieved from user posts, we were ableto identify information brokers which seem to play importantroles in the shaping the information content of the forum. Additionally,we were able to identify potential side effects consistentlydiscussed by groups of users. Such an approach could beused to raise red flags in future clinical surveillance operations,as well as highlighting various other treatment related issues.The results have opened new possibilities into developing advancedsolutions, as well as revealing challenges in developingsuch solutions.The consensus on Erlotinib depends on individual patientexperience. Social media, by its nature, will bring different individualswith different experiences and viewpoints. We siftedthrough the data to find positive and negative sentiment, whichwas later confirmed by research that emerged regarding Erlotinib’seffectiveness and side effects. Future studies will requiremore up-to-date information for a clearer picture of userfeedback on drugs and services.Future solutions will require more advanced detection of intersocialdynamics and its effects on the members: such interestsof study may include rankings, ‘likes’ of posts, and friendships.Further emphasis on context posting will require formal languagedictionaries that include medical terms for specific diseases,and informal language terms (‘slang’) to clarify posts.Finally, different platforms will allow up-to-date informationon the status of the drug in case one social platform ceases todiscuss the drug. Another solution can look at multiple wordliststhat can include multiple treatments that, when combined withcontextual posting and medical lexical dictionaries, can pinpointthe source (or multiple sources) of user satisfaction (ordissatisfaction), which can open the door towards mapping consumersentiment of multidrug therapies for advanced diseases.The combined solutions can open newavenues of postmarketingsurveillance research as companies seek real-time, ‘intelligent’data of their products and services to remain competitive.This solution can be envisioned on future medical devicesthat can serve as postmarketing feedback loop that consumerscan use to express their satisfaction (or dissatisfaction) directlyto the company. The company benefits from real-time feedbackthat can then be used to assess if there are any problems andrapidly address such problems.Social media can open the door for the health care sector inaddress cost reduction, product and service optimization, andpatient care.

Neighbor Similarity Trust against Sybil Attack in P2P E-Commerce

05/08/201902/07/2019 by admin

In this paper, we present a distributed structured approach to Sybil attack. This is derived from the fact that our approach is based on the neighbor similarity trust relationship among the neighbor peers. Given a P2P e-commerce trust relationship based on interest, the transactions among peers are flexible as each peer can decide to trade with another peer any time. A peer doesn’t have to consult others in a group unless a recommendation is needed. This approach shows the advantage in exploiting the similarity trust relationship among peers in which the peers are able to monitor each other.

Our contribution in this paper is threefold:

1) We propose SybilTrust that can identify and protect honest peers from Sybil attack. The Sybil peers can have their trust canceled and dismissed from a group.

2) Based on the group infrastructure in P2P e-commerce, each neighbor is connected to the peers by the success of the transactions it makes or the trust evaluation level. A peer can only be recognized as a neighbor depending on whether or not trust level is sustained over a threshold value.

3) SybilTrust enables neighbor peers to carry recommendation identifiers among the peers in a group. This ensures that the group detection algorithms to identify Sybil attack peers to be efficient and scalable in large P2P e-commerce networks.

GOAL OF THE PROJECT:

The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group. Each peer has an identity, which is either honest or Sybil.

A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level, application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay).

1.2 INTRODUCTION:

P2P networks range from communication systems like email and instant messaging to collaborative content rating, recommendation, and delivery systems such as YouTube, Gnutela, Facebook, Digg, and BitTorrent. They allow any user to join the system easily at the expense of trust, with very little validation control. P2P overlay networks are known for their many desired attributes like openness, anonymity, decentralized nature, self-organization, scalability, and fault tolerance. Each peer plays the dual role of client as well as server, meaning that each has its own control. All the resources utilized in the P2P infrastructure are contributed by the peers themselves unlike traditional methods where a central authority control is used. Peers can collude and do all sorts of malicious activities in the open-access distributed systems. These malicious behaviors lead to service quality degradation and monetary loss among business partners. Peers are vulnerable to exploitation, due to the open and near-zero cost of creating new identities. The peer identities are then utilized to influence the behavior of the system.

However, if a single defective entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining the redundancy. The number of identities that an attacker can generate depends on the attacker’s resources such as bandwidth, memory, and computational power. The goal of trust systems is to ensure that honest peers are accurately identified as trustworthy and Sybil peers as untrustworthy. To unify terminology, we call all identities created by malicious users as Sybil peers. In a P2P e-commerce application scenario, most of the trust considerations depend on the historical factors of the peers. The influence of Sybil identities can be reduced based on the historical behavior and recommendations from other peers. For example, a peer can give positive a recommendation to a peer which is discovered is a Sybil or malicious peer. This can diminish the influence of Sybil identities hence reduce Sybil attack. A peer which has been giving dishonest recommendations will have its trust level reduced. In case it reaches a certain threshold level, the peer can be expelled from the group.

Each peer has an identity, which is either honest or Sybil. A Sybil identity can be an identity owned by a malicious user, or it can be a bribed/stolen identity, or it can be a fake identity obtained through a Sybil attack. These Sybil attack peers are employed to target honest peers and hence subvert the system. In Sybil attack, a single malicious user creates a large number of peer identities called sybils. These sybils are used to launch security attacks, both at the application level and at the overlay level at the application level, sybils can target other honest peers while transacting with them, whereas at the overlay level, sybils can disrupt the services offered by the overlay layer like routing, data storage, lookup, etc. In trust systems, colluding Sybil peers may artificially increase a (malicious) peer’s rating (e.g., eBay). Systems like Credence rely on a trusted central authority to prevent maliciousness.

Defending against Sybil attack is quite a challenging task. A peer can pretend to be trusted with a hidden motive. The peer can pollute the system with bogus information, which interferes with genuine business transactions and functioning of the systems. This must be counter prevented to protect the honest peers. The link between an honest peer and a Sybil peer is known as an attack edge. As each edge involved resembles a human-established trust, it is difficult for the adversary to introduce an excessive number of attack edges. The only known promising defense against Sybil attack is to use social networks to perform user admission control and limit the number of bogus identities admitted to a system. The use of social networks between two peers represents real-world trust relationship between users. In addition, authentication-based mechanisms are used to verify the identities of the peers using shared encryption keys, or location information.

1.3 LITRATURE SURVEY:

KEEP YOUR FRIENDS CLOSE: INCORPORATING TRUST INTO SOCIAL NETWORK-BASED SYBIL DEFENSES

AUTHOR: A. Mohaisen, N. Hopper, and Y. Kim

PUBLISH: Proc. IEEE Int. Conf. Comput. Commun., 2011, pp. 1–9.

EXPLANATION:

Social network-based Sybil defenses exploit the algorithmic properties of social graphs to infer the extent to which an arbitrary node in such a graph should be trusted. However, these systems do not consider the different amounts of trust represented by different graphs, and different levels of trust between nodes, though trust is being a crucial requirement in these systems. For instance, co-authors in an academic collaboration graph are trusted in a different manner than social friends. Furthermore, some social friends are more trusted than others. However, previous designs for social network-based Sybil defenses have not considered the inherent trust properties of the graphs they use. In this paper we introduce several designs to tune the performance of Sybil defenses by accounting for differential trust in social graphs and modeling these trust values by biasing random walks performed on these graphs. Surprisingly, we find that the cost function, the required length of random walks to accept all honest nodes with overwhelming probability, is much greater in graphs with high trust values, such as co-author graphs, than in graphs with low trust values such as online social networks. We show that this behavior is due to the community structure in high-trust graphs, requiring longer walk to traverse multiple communities. Furthermore, we show that our proposed designs to account for trust, while increase the cost function of graphs with low trust value, decrease the advantage of attacker.

FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN VEHICULAR NETWORKS

AUTHOR: S. Chang, Y. Qi, H. Zhu, J. Zhao, and X. Shen

PUBLISH: IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 6, pp. 1103–1114, Jun. 2012.

EXPLANATION:

In urban vehicular networks, where privacy, especially the location privacy of anonymous vehicles is highly concerned, anonymous verification of vehicles is indispensable. Consequently, an attacker who succeeds in forging multiple hostile identifies can easily launch a Sybil attack, gaining a disproportionately large influence. In this paper, we propose a novel Sybil attack detection mechanism, Footprint, using the trajectories of vehicles for identification while still preserving their location privacy. More specifically, when a vehicle approaches a road-side unit (RSU), it actively demands an authorized message from the RSU as the proof of the appearance time at this RSU. We design a location-hidden authorized message generation scheme for two objectives: first, RSU signatures on messages are signer ambiguous so that the RSU location information is concealed from the resulted authorized message; second, two authorized messages signed by the same RSU within the same given period of time (temporarily linkable) are recognizable so that they can be used for identification. With the temporal limitation on the linkability of two authorized messages, authorized messages used for long-term identification are prohibited. With this scheme, vehicles can generate a location-hidden trajectory for location-privacy-preserved identification by collecting a consecutive series of authorized messages. Utilizing social relationship among trajectories according to the similarity definition of two trajectories, Footprint can recognize and therefore dismiss “communities” of Sybil trajectories. Rigorous security analysis and extensive trace-driven simulations demonstrate the efficacy of Footprint.

SYBILLIMIT: A NEAROPTIMAL SOCIAL NETWORK DEFENSE AGAINST SYBIL ATTACK

AUTHOR: H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao

PUBLISH: IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 3–17, Jun. 2010.

EXPLANATION:

Decentralized distributed systems such as peer-to-peer systems are particularly vulnerable to sybil attacks, where a malicious user pretends to have multiple identities (called sybil nodes). Without a trusted central authority, defending against sybil attacks is quite challenging. Among the small number of decentralized approaches, our recent SybilGuard protocol [H. Yu et al., 2006] leverages a key insight on social networks to bound the number of sybil nodes accepted. Although its direction is promising, SybilGuard can allow a large number of sybil nodes to be accepted. Furthermore, SybilGuard assumes that social networks are fast mixing, which has never been confirmed in the real world. This paper presents the novel SybilLimit protocol that leverages the same insight as SybilGuard but offers dramatically improved and near-optimal guarantees. The number of sybil nodes accepted is reduced by a factor of ominus(radicn), or around 200 times in our experiments for a million-node system. We further prove that SybilLimit’s guarantee is at most a log n factor away from optimal, when considering approaches based on fast-mixing social networks. Finally, based on three large-scale real-world social networks, we provide the first evidence that real-world social networks are indeed fast mixing. This validates the fundamental assumption behind SybilLimit’s and SybilGuard’s approach.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

Existing work on Sybil attack makes use of social networks to eliminate Sybil attack, and the findings are based on preventing Sybil identities. In this paper, we propose the use of neighbor similarity trust in a group P2P ecommerce based on interest relationships, to eliminate maliciousness among the peers. This is referred to as SybilTrust. In SybilTrust, the interest based group infrastructure peers have a neighbor similarity trust between each other, hence they are able to prevent Sybil attack. SybilTrust gives a better relationship in e-commerce transactions as the peers create a link between peer neighbors. This provides an important avenue for peers to advertise their products to other interested peers and to know new market destinations and contacts as well. In addition, the group enables a peer to join P2P e-commerce network and makes identity more difficult.

Peers use self-certifying identifiers that are exchanged when they initially come into contact. These can be used as public keys to verify digital signatures on the messages sent by their neighbors. We note that, all communications between peers are digitally signed. In this kind of relationship, we use neighbors as our point of reference to address Sybil attack. In a group, whatever admission we set, there are honest, malicious, and Sybil peers who are authenticated by an admission control mechanism to join the group. More honest peers are admitted compared to malicious peers, where the trust association is aimed at positive results. The knowledge of the graph may reside in a single party, or be distributed across all users.

2.1.0 DISADVANTAGES:

Sybil peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes peers existing in a group have six types of keys.

The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete.

Fake Users Enters Easy.
This makes Sybil attacks.

2.2 PROPOSED SYSTEM:

In this paper, we assume there are three kinds of peers in the system: legitimate peers, malicious peers, and Sybil peers. Each malicious peer cheats its neighbors by creating multiple identity, referred to as Sybil peers. In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group.

The principal building block of Sybil Trust approach is the identifier distribution process. In the approach, all the peers with similar behavior in a group can be used as identifier source. They can send identifiers to others as the system regulates. If a peer sends less or more, the system can be having a Sybil attack peer. The information can be broadcast to the rest of the peers in a group. When peers join a group, they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has.

Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating

2.2.0 ADVANTAGES:

Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers.

It is Helpful to find Sybil Attacks.
It is used to Find Fake UserID.
It is feasible to limit the number of attack edges in online social networks by relationship rating.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.0 HARDWARE REQUIREMENT:

v Processor – Pentium –IV

Speed – 1.1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 1.44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA

2.3.0 SOFTWARE REQUIREMENTS:

Operating System : Windows XP or Win7
Front End : JAVA JDK 1.7
Script : Java Script
Tools : Netbeans 7
Document : MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGNS

3.1 ARCHITECTURE DIAGRAM:

3.2 DATAFLOW DIAGRAM:

LEVEL 0:

Neighbor Nodes

Source

LEVEL 1:

P2P Sybil Trust Mode

Send Data Request

LEVEL 2:

Data Receive

P2P ACK

Active Attack (Malicious Node)

Send Data Request

LEVEL 3:

3.3 UML DIAGRAMS

3.3.0 USECASE DIAGRAM:

SERVER CLIENT

3.3.1 CLASS DIAGRAM:

3.3.2 SEQUENCE DIAGRAM:

3.4 ACITVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

In this paper, P2P e-commerce communities are in several groups. A group can be either open or restrictive depending on the interest of the peers. We investigate the peers belonging to a certain interest group. In each group, there is a group leader who is responsible for managing coordination of activities in a group peers join a group; they acquire different identities in reference to the group. Each peer has neighbors in the group and outside the group. Sybil attack peers forged by the same malicious peer have the same set of physical neighbors that a malicious peer has. Each neighbor is connected to the peers by the success of the transaction it makes or the trust evaluation level. To detect the Sybil attack, where a peer can have different identity, a peer is evaluated in reference to its trustworthiness and the similarity to the neighbors. If the neighbors do not have same trust data as the concerned peer, including its position, it can be detected that the peer has multiple identity and is cheating. The method of detection of Sybil attack is depicted in Fig. 2. A1 and A2 refer to the same peer but with different identities.

Our approach, the identifiers are only propagated by the peers who exhibit neighbor similarity trust. Our perception is that, the attacker controls a number of neighbor similarity peers, whereby a randomly chosen identifier source is relatively “far away” from most Sybil attack peer relationship. Every peer uses a “reversed” routing table. The source peer will always send some information to the peers which have neighbor similarity trust. However, if they do not reply, it can black list them. If they do reply and the source is overwhelmed by the overhead of such replies, then the adversary is effectively launching a DoS attack. Notice that the adversary can launch a DoS attack against the source. This enables two peers to propagate their public keys and IP addresses backward along the route to learn about the peers. SybilTrust proposes that an honest peer should not have an excessive number of neighbors. The neighbors we refer should be member peers existing in a group. The restriction helps to bind the number of peers against any additional attack among the neighbors. If there are too many neighbors, SybilTrust will (internally) only use a subset of the peer’s edges while ignoring all others. Following Liben-Nowell and Kleinberg, we define the attributes of the given pair of peers as the intersection of the sets of similar products.

4.1 MODULES:

SIMILARITY TRUST RELATIONSHIP:

NEIGHBOR SIMILARITY TRUST:

DETECTION OF SYBIL ATTACK:

SECURITY AND PERFORMANCE:

4.2 MODULES DESCRIPTION:

SIMILARITY TRUST RELATIONSHIP:

We focus on the active attacks in P2P e-commerce. When a peer is compromised, all the information will be extracted. In our work, we have proposed use of SybilTrust which is based on neighbor similarity relationship of the peers. SybilTrust is efficient and scalable to group P2P e-commerce network. Sybil attack peers may attempt to compromise the edges or the peers of the group P2P e-commerce. The Sybil attack peers can execute further malicious actions in the network. The threat being addressed is the identity active attacks as peers are continuously doing the transactions in the peers to show that each controller only admitted the honest peers.

Our method makes assumptions that the controller undergoes synchronization to prove whether the peers which acted as distributor of identifiers had similarityor not. If a peer never had similarity, the peer is assumed to have been a Sybil attack peer. Pairing method is used to generate an expander graph with expansion factor of high probability. Every pair of neighbor peers share a unique symmetric secret key (the edge key), established out of band for authenticating each other peers may deliberately cause Byzantine faults in which their multiple identity and incorrect behavior ends up undetected.

The Sybil attack peers can create more non-existent links. The protocols and services for P2P, such as routing protocols must operate efficiently regardless of the group size. In the neighbor similarity trust, peers must have a self-healing in order to recover automatically from any state. Sybil attack can defeat replication and fragmentation performed in distributed hash tables. Geographic routing in P2P can also be a routing mechanism which can be compromised by Sybil peers.

NEIGHBOR SIMILARITY TRUST:

We present a Sybil identification algorithm that takes place in a neighbor similarity trust. The directed graph has edges and vertices. In our work, we assume V is the set of peers and E is the set of edges. The edges in a neighbor similarity have attack edges which are safeguarded from Sybil attacks. A peer u and a Sybil peer v can trade whether one is Sybil or not. Being in a group, comparison can be done to determine the number of peers which trade with peer. If the peer trades with very few unsuccessful transactions, we can deduce the peer is a Sybil peer. This is supported by our approach which proposes a peer existing in a group has six types of keys. The keys which exist mostly are pairwise keys supported by the group keys. We also note if an honest group has a link with another group which has Sybil peers, the Sybil group tend to have information which is not complete. Our algorithm adaptively tests the suspected peer while maintaining the neighbor similarity trust connection based on time.

DETECTION OF SYBIL ATTACK:

Sybil attack, a malicious peer must try to present multiple distinct identities. This can be achieved by either generating legal identities or by impersonating other normal peers. Some peers may launch arbitrary attacks to interfere with P2P e-commerce operations, or the normal functioning of the network. According to an attack can succeed to launch a Sybil attack by:

_ Heterogeneous configuration: in this case, malicious peers can have more communication and computation resources than the honest peers.

_ Message manipulation: the attacker can eavesdrop on nearby communications with other parties. This means a attacker gets and interpolates information needed to impersonate others. Major attacks in P2P e-commerce can be classified as passive and active attacks.

_ Passive attack: It listens to incoming and outgoing messages, in order to infer the relevant information from the transmitted recommendations, i.e., eavesdropping, but doesn’t harm the system. A peer can be in passive mode and later in active mode.

_ Active attack: When a malicious peer receives a recommendation for forwarding, it can modify, or when requested to provide recommendations on another peer, it can inflate or bad mouth. The bad mouthing is a situation where a malicious peer may collude with other malicious peers to revenge the honest peer. In the Sybil attack, a malicious peer generates a large number of identities and uses them together to disrupt normal operation.

SECURITY AND PERFORMANCE:

We evaluate the performance of the proposed SybilTrust. We measure two metrics, namely, non-trustworthy rate and detection rate. Non-trustworthy rate is the ratio of the number of honest peers which are erroneously marked as Sybil/malicious peer to the number of total honest peers. Detection rate is the proportion of detected Sybil/ malicious peers to the total Sybil/malicious peers. Communication Cost. The trust level is sent with the recommendation feedback from one peer to another. If a peer is compromised, the information is broadcasted to all peers as revocation of the trust level is being done. Computation Cost. The sybilTrust approach is efficient in the computation of polynomial evaluation. The calculation of the trust level evaluation is based on a pseudo-random function (PRF). PRF is a deterministic function.

In our simulation, we use C# .NET tool. Each honest and malicious peer interacted with a random number of peers defined by a uniform distribution. All the peers are restricted to the group. Our approach, P2P e-commerce community has a total of 3 different categories of interest. The transaction interactions between peers with similar interest can be defined as successful or unsuccessful, expressed as positive or negative respectively. The impact of the first two parameters on performance of the mechanism is evaluated in the percentage of malicious peers replied is randomly chosen by each malicious peer. Transactions with 10 to 40 percent malicious peers are done.

Our SybilTrust approach detects more malicious peers compared to Eigen Trust and Eigen Group Trust [26] as shown in Fig. 4. Fig. 4. shows the detection rates of the P2P when the number of malicious peers increases. When the number of deployed peers is small, e.g., 40 peers, the chance that no peers are around a malicious peer is high. Fig. 4 illustrates the variation of non-trustworthy rates of different numbers of honest peers as the number of malicious peer increases. It is shown that the non-trustworthy rate increases as the number of honest peers and malicious peers increase. The reason is that when there are more malicious peers, the number of target groups is larger. Moreover, this is because neighbor relationship is used to categorize peers in the

We proposed approach. The number of target-groups also increases when the number of honest peers is higher. As a result, the honest peers are examined more times, and the chance that an honest peer is erroneously determined as a Sybil/malicious peer increases, although more Sybil attack peer can also be identified. Fig. 4 displays the detection rate when the reply rate of each malicious peer is the same. The detection rate does not decrease when the reply rate is more than 80 percent, because of the enhancement.

The enhancement could still be found even when a malicious peer replies to almost all of its Sybil attack peer requests. Furthermore, the detection rate is higher as the number of malicious peers becomes more, which means the proposed mechanism is able to resist the Sybil attack from more malicious peers. The detection rate is still more than 80 percent in the sparse network, which according to the definition of a sparse network detection rate reaches 95 percent when the number of legitimate nodes is 300. It is also because the number of target groups increases as the number of malicious peer’s increases and the honest peers are examined more times. Therefore, the rate that an honest peer is erroneously identified as a Sybil/malicious peer also increases.

CHAPTER 5

5.0 SYSTEM STUDY:

5.1 FEASIBILITY STUDY:

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY

5.1.1 ECONOMICAL FEASIBILITY:

5.1.2 TECHNICAL FEASIBILITY:

5.1.3 SOCIAL FEASIBILITY:

5.2 SYSTEM TESTING:

Testing is a process of checking whether the developed system is working according to the original objectives and requirements. It is a set of activities that can be planned in advance and conducted systematically. Testing is vital to the success of the system. System testing makes a logical assumption that if all the parts of the system are correct, the global will be successfully achieved. In adequate testing if not testing leads to errors that may not appear even many months. This creates two problems, the time lag between the cause and the appearance of the problem and the effect of the system errors on the files and records within the system. A small system error can conceivably explode into a much larger Problem. Effective testing early in the purpose translates directly into long term cost savings from a reduced number of errors. Another reason for system testing is its utility, as a user-oriented vehicle before implementation. The best programs are worthless if it produces the correct outputs.

5.2.1 UNIT TESTING:

UNIT TESTING:

Description	Expected result
Test for application window properties.	All the properties of the windows are to be properly aligned and displayed.
Test for mouse operations.	All the mouse operations like click, drag, etc. must perform the necessary operations without any exceptions.

5.1.3 FUNCTIONAL TESTING:

FUNCTIONAL TESTING:

Description	Expected result
Test for all modules.	All peers should communicate in the group.
Test for various peer in a distributed network framework as it display all users available in the group.	The result after execution should give the accurate result.

5.1. 4 NON-FUNCTIONAL TESTING:

Load testing
Performance testing
Usability testing
Reliability testing
Security testing

5.1.5 LOAD TESTING:

Load Testing

Description	Expected result
It is necessary to ascertain that the application behaves correctly under loads when ‘Server busy’ response is received.	Should designate another active node as a Server.

5.1.5 PERFORMANCE TESTING:

PERFORMANCE TESTING:

Description	Expected result
This is required to assure that an application perforce adequately, having the capability to handle many peers, delivering its results in expected time and using an acceptable level of resource and it is an aspect of operational management.	Should handle large input values, and produce accurate result in a expected time.

5.1.6 RELIABILITY TESTING:

RELIABILITY TESTING:

Description	Expected result
This is to check that the server is rugged and reliable and can handle the failure of any of the components involved in provide the application.	In case of failure of the server an alternate server should take over the job.

5.1.7 SECURITY TESTING:

SECURITY TESTING:

Description	Expected result
Checking that the user identification is authenticated.	In case failure it should not be connected in the framework.
Check whether group keys in a tree are shared by all peers.	The peers should know group key in the same group.

5.1.7 WHITE BOX TESTING:

5.1.8 WHITE BOX TESTING:

Description	Expected result
Exercise all logical decisions on their true and false sides.	All the logical decisions must be valid.
Execute all loops at their boundaries and within their operational bounds.	All the loops must be finite.
Exercise internal data structures to ensure their validity.	All the data structures must be valid.

5.1.9 BLACK BOX TESTING:

5.1.10 BLACK BOX TESTING:

Description	Expected result
To check for incorrect or missing functions.	All the functions must be valid.
To check for interface errors.	The entire interface must function normally.
To check for errors in a data structures or external data base access.	The database updation and retrieval must be done.
To check for initialization and termination errors.	All the functions and data structures must be initialized properly and terminated normally.

All the above system testing strategies are carried out in as the development, documentation and institutionalization of the proposed goals and related policies is essential.

CHAPTER 7

7.0 SOFTWARE SPECIFICATION:

7.1 FEATURES OF .NET:

Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET Framework is a language-neutral platform for writing programs that can easily and securely interoperate. There’s no language barrier with .NET: there are numerous languages available to the developer including Managed C++, C#, Visual Basic and Java Script.

The .NET framework provides the foundation for components to interact seamlessly, whether locally or remotely on different platforms. It standardizes common data types and communications protocols so that components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance) and services (like Passport, .NET My Services, and so on).

7.2 THE .NET FRAMEWORK

The .NET Framework has two main parts:

1. The Common Language Runtime (CLR).

2. A hierarchical set of class libraries.

The CLR is described as the “execution engine” of .NET. It provides the environment within which programs run. The most important features are

Conversion from a low-level assembler-style language, called Intermediate Language (IL), into code native to the platform being executed on.
Memory management, notably including garbage collection.
Checking and enforcing security restrictions on the running code.
Loading and executing programs, with version control and other such features.
The following features of the .NET framework are also worth description:

Managed Code

The code that targets .NET, and which contains certain extra Information – “metadata” – to describe itself. Whilst both managed and unmanaged code can run in the runtime, only managed code contains the information that allows the CLR to guarantee, for instance, safe execution and interoperability.

Managed Data

With Managed Code comes Managed Data. CLR provides memory allocation and Deal location facilities, and garbage collection. Some .NET languages use Managed Data by default, such as C#, Visual Basic.NET and JScript.NET, whereas others, namely C++, do not. Targeting CLR can, depending on the language you’re using, impose certain constraints on the features available. As with managed and unmanaged code, one can have both managed and unmanaged data in .NET applications – data that doesn’t get garbage collected but instead is looked after by unmanaged code.

Common Type System

The CLR uses something called the Common Type System (CTS) to strictly enforce type-safety. This ensures that all classes are compatible with each other, by describing types in a common way. CTS define how types work within the runtime, which enables types in one language to interoperate with types in another language, including cross-language exception handling. As well as ensuring that types are only used in appropriate ways, the runtime also ensures that code doesn’t attempt to access memory that hasn’t been allocated to it.

Common Language Specification

The CLR provides built-in support for language interoperability. To ensure that you can develop managed code that can be fully used by developers using any programming language, a set of language features and rules for using them called the Common Language Specification (CLS) has been defined. Components that follow these rules and expose only CLS features are considered CLS-compliant.

7.3 THE CLASS LIBRARY

.NET provides a single-rooted hierarchy of classes, containing over 7000 types. The root of the namespace is called System; this contains basic types like Byte, Double, Boolean, and String, as well as Object. All objects derive from System. Object. As well as objects, there are value types. Value types can be allocated on the stack, which can provide useful flexibility. There are also efficient means of converting value types to object types if and when necessary.

The set of classes is pretty comprehensive, providing collections, file, screen, and network I/O, threading, and so on, as well as XML and database connectivity.

The class library is subdivided into a number of sets (or namespaces), each providing distinct areas of functionality, with dependencies between the namespaces kept to a minimum.

7.4 LANGUAGES SUPPORTED BY .NET

The multi-language capability of the .NET Framework and Visual Studio .NET enables developers to use their existing programming skills to build all types of applications and XML Web services. The .NET framework supports new versions of Microsoft’s old favorites Visual Basic and C++ (as VB.NET and Managed C++), but there are also a number of new additions to the family.

Visual Basic .NET has been updated to include many new and improved language features that make it a powerful object-oriented programming language. These features include inheritance, interfaces, and overloading, among others. Visual Basic also now supports structured exception handling, custom attributes and also supports multi-threading.

Visual Basic .NET is also CLS compliant, which means that any CLS-compliant language can use the classes, objects, and components you create in Visual Basic .NET.

Managed Extensions for C++ and attributed programming are just some of the enhancements made to the C++ language. Managed Extensions simplify the task of migrating existing C++ applications to the new .NET Framework.

C# is Microsoft’s new language. It’s a C-style language that is essentially “C++ for Rapid Application Development”. Unlike other languages, its specification is just the grammar of the language. It has no standard library of its own, and instead has been designed with the intention of using the .NET libraries as its own.

Microsoft Visual J# .NET provides the easiest transition for Java-language developers into the world of XML Web Services and dramatically improves the interoperability of Java-language programs with existing software written in a variety of other programming languages.

Active State has created Visual Perl and Visual Python, which enable .NET-aware applications to be built in either Perl or Python. Both products can be integrated into the Visual Studio .NET environment. Visual Perl includes support for Active State’s Perl Dev Kit.

Other languages for which .NET compilers are available include

FORTRAN
COBOL
Eiffel

ASP.NET XML WEB SERVICES	Windows Forms
Base Class Libraries
Common Language Runtime
Operating System

Fig1 .Net Framework

C#.NET is also compliant with CLS (Common Language Specification) and supports structured exception handling. CLS is set of rules and constructs that are supported by the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET Framework; it manages the execution of the code and also makes the development process easier by providing services.

C#.NET is a CLS-compliant language. Any objects, classes, or components that created in C#.NET can be used in any other CLS-compliant language. In addition, we can use objects, classes, and components created in other CLS-compliant languages in C#.NET .The use of CLS ensures complete interoperability among applications, regardless of the languages used to create the application.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them. In other words, destructors are used to release the resources allocated to the object. In C#.NET the sub finalize procedure is available. The sub finalize procedure is used to complete the tasks that must be performed when an object is destroyed. The sub finalize procedure is called automatically when an object is destroyed. In addition, the sub finalize procedure can be called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION

Garbage Collection is another new feature in C#.NET. The .NET Framework monitors allocated resources, such as objects and variables. In addition, the .NET Framework automatically releases memory for reuse by destroying objects that are no longer in use.

In C#.NET, the garbage collector checks for the objects that are not currently in use by applications. When the garbage collector comes across an object that is marked for garbage collection, it releases the memory occupied by the object.

OVERLOADING

Overloading is another feature in C#. Overloading enables us to define multiple procedures with the same name, where each procedure has a different set of arguments. Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

C#.NET also supports multithreading. An application that supports multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease the time taken by an application to respond to user interaction.

STRUCTURED EXCEPTION HANDLING

C#.NET supports structured handling, which enables us to detect and remove errors at runtime. In C#.NET, we need to use Try…Catch…Finally statements to create exception handlers. Using Try…Catch…Finally statements, we can create robust and effective exception handlers to improve the performance of our application.

7.5 THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK

1. To provide a consistent object-oriented programming environment whether object codes is stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-based applications.

7.6 FEATURES OF SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL Server 2000 Analysis Services. The term OLAP Services has been replaced with the term Analysis Services. Analysis Services also includes a new data mining component. The Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server 2000 Meta Data Services. References to the component now use the term Meta Data Services. The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

7.7 TABLE:

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

To build or modify the structure of a table we work in the table design view. We can specify what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the question from one or more table. The data that make up the answer is either dynaset (if you edit it) or a snapshot (it cannot be edited).Each time we run query, we get latest information in the dynaset. Access either displays the dynaset or snapshot for us to view or perform an action on it, such as deleting or updating.

CHAPTER 7

APPENDIX

7.1 SAMPLE SOURCE CODE

7.2 SAMPLE OUTPUT

CHAPTER 8

8.0 CONCLUSION AND FUTURE:

We presented SybilTrust, a defense against Sybil attack in P2P e-commerce. Compared to other approaches, our approach is based on neighborhood similarity trust in a group P2P e-commerce community. This approach exploits the relationship between peers in a neighborhood setting. Our results on real-world P2P e-commerce confirmed fastmixing property hence validated the fundamental assumption behind SybilGuard’s approach. We also describe defense types such as key validation, distribution, and position verification. This method can be done at in simultaneously with neighbor similarity trust which gives better defense mechanism. For the future work, we intend to implement SybilTrust within the context of peers which exist in many groups. Neighbor similarity trust helps to weed out the Sybil peers and isolate maliciousness to specific Sybil peer groups rather than allow attack in honest groups with all honest peers.