Real-Time Detection of Traffic From Twitter Stream Analysis

Real-Time Detection of Traffic FromTwitter Stream AnalysisAbstract—Social networks have been recently employed as asource of information for event detection, with particular referenceto road traffic congestion and car accidents. In this paper, wepresent a real-time monitoring system for traffic event detectionfrom Twitter stream analysis. The system fetches tweets fromTwitter according to several search criteria; processes tweets, byapplying text mining techniques; and finally performs the classificationof tweets. The aim is to assign the appropriate class label toeach tweet, as related to a traffic event or not. The traffic detectionsystem was employed for real-time monitoring of several areas ofthe Italian road network, allowing for detection of traffic eventsalmost in real time, often before online traffic news web sites. Weemployed the support vector machine as a classification model,and we achieved an accuracy value of 95.75% by solving a binaryclassification problem (traffic versus nontraffic tweets). We werealso able to discriminate if traffic is caused by an external event ornot, by solving a multiclass classification problem and obtainingan accuracy value of 88.89%.Index Terms—Traffic event detection, tweet classification, textmining, social sensing.I. INTRODUCTIONSOCIAL network sites, also called micro-blogging services(e.g., Twitter, Facebook, Google+), have spread in recentyears, becoming a new kind of real-time information channel.Their popularity stems from the characteristics of portabilitythanks to several social networks applications for smartphonesand tablets, easiness of use, and real-time nature [1], [2]. Peopleintensely use social networks to report (personal or public) reallifeevents happening around them or simply to express theiropinion on a given topic, through a public message. Socialnetworks allow people to create an identity and let them shareit in order to build a community. The resulting social networkis then a basis for maintaining social relationships, findingManuscript received July 2, 2014; revised October 7, 2014 and December 16,2014; accepted February 10, 2015. Date of publication March 10, 2015; date ofcurrent version July 31, 2015. This work was carried out in the frameworkof and was supported by the SMARTY project, funded by “ProgrammaOperativo Regionale (POR) 2007–2013”—objective “Competitività regionalee occupazione” of the Tuscany Region. The Associate Editor for this paper wasQ. Zhang.E. D’Andrea is with the Research Center “E. Piaggio,” University of Pisa,56122 Pisa, Italy (e-mail: eleonora.dandrea@for.unipi.it).P. Ducange is with the Faculty of Engineering, eCampus University, 22060Novedrate, Italy (e-mail: pietro.ducange@uniecampus.it).B. Lazzerini and F. Marcelloni are with the Dipartimento di Ingegneriadell’Informazione, University of Pisa, 56122 Pisa, Italy (e-mail: b.lazzerini@iet.unipi.it; f.marcelloni@iet.unipi.it).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TITS.2015.2404431users with similar interests, and locating content and knowledgeentered by other users [3].The user message shared in social networks is called StatusUpdate Message (SUM), and it may contain, apart from thetext, meta-information such as timestamp, geographic coordinates(latitude and longitude), name of the user, links to otherresources, hashtags, and mentions. Several SUMs referring toa certain topic or related to a limited geographic area may provide,if correctly analyzed, great deal of valuable informationabout an event or a topic. In fact, we may regard social networkusers as social sensors [4], [5], and SUMs as sensor information[6], as it happens with traditional sensors.Recently, social networks and media platforms have beenwidely used as a source of information for the detection ofevents, such as traffic congestion, incidents, natural disasters(earthquakes, storms, fires, etc.), or other events. An eventcan be defined as a real-world occurrence that happens in aspecific time and space [1], [7]. In particular, regarding trafficrelatedevents, people often share by means of an SUM informationabout the current traffic situation around them whiledriving. For this reason, event detection from social networksis also often employed with Intelligent Transportation Systems(ITSs). An ITS is an infrastructure which, by integrating ICTs(Information and Communication Technologies) with transportnetworks, vehicles and users, allows improving safety and managementof transport networks. ITSs provide, e.g., real-timeinformation about weather, traffic congestion or regulation, orplan efficient (e.g., shortest, fast driving, least polluting) routes[4], [6], [8]–[14].However, event detection from social networks analysis isa more challenging problem than event detection from traditionalmedia like blogs, emails, etc., where texts are wellformatted[2]. In fact, SUMs are unstructured and irregulartexts, they contain informal or abbreviated words, misspellingsor grammatical errors [1]. Due to their nature, they are usuallyvery brief, thus becoming an incomplete source of information[2]. Furthermore, SUMs contain a huge amount of not usefulor meaningless information [15], which has to be filtered.According to Pear Analytics,1 it has been estimated that over40% of all Twitter2 SUMs (i.e., tweets) is pointless with nouseful information for the audience, as they refer to the personalsphere [16]. For all of these reasons, in order to analyze theinformation coming from social networks, we exploit text miningtechniques [17], which employ methods from the fields of1http://www.pearanalytics.com/, 2009.2https://twitter.com.1524-9050 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2270 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015data mining, machine learning, statistics, and Natural LanguageProcessing (NLP) to extract meaningful information [18].More in detail, text mining refers to the process of automaticextraction of meaningful information and knowledge from unstructuredtext. The main difficulty encountered in dealing withproblems of text mining is caused by the vagueness of naturallanguage. In fact, people, unlike computers, are perfectly able tounderstand idioms, grammatical variations, slang expressions,or to contextualize a given word. On the contrary, computershave the ability, lacking in humans, to quickly process largeamounts of information [19], [20].The text mining process is summarized in the following.First, the information content of the document is convertedinto a structured form (vector space representation). In fact,most of text mining techniques are based on the idea that adocument can be faithfully represented by the set of wordscontained in it (bag-of-words representation [21]). Accordingto this representation, each document j of a collection ofdocuments is represented as an M-dimensional vector Vj ={w(tj1), . . . , w(tji), . . . , w(tjM)}, where M is the number ofwords defined in the document collection, and w(tji) specifiesthe weight of the word ti in document j. The simplest weightingmethod assigns a binary value to w(tji), thus indicating theabsence or the presence of the word ti, while other methodsassign a real value to w(tji). During the text mining process,several operations can be performed [21], depending on the specificgoal, such as: i) linguistic analysis through the applicationof NLP techniques, indexing and statistical techniques, ii) textfiltering by means of specific keywords, iii) feature extraction,i.e., conversion of textual features (e.g., words) in numericfeatures (e.g., weights), that a machine learning algorithm isable to process, and iv) feature selection, i.e., reduction of thenumber of features in order to take into account only the mostrelevant ones. The feature selection is particularly important,since one of the main problems in text mining is the highdimensionality of the feature space _M. Then, data miningand machine learning algorithms (i.e., support vector machines(SVMs), decision trees, neural networks, etc.) are applied tothe documents in the vector space representation, to build classification,clustering or regression models. Finally, the resultsobtained by the model are interpreted by means of measuresof effectiveness (e.g., statistical-based measures) to verify theaccuracy achieved. Additionally, the obtained results may beimproved, e.g., by modifying the values of the parameters usedand repeating the whole process.Among social networks platforms, we took into accountTwitter, as the majority of works in the literature regardingevent detection focus on it. Twitter is nowadays the mostpopular micro-blogging service; it counts more than 600 millionactive users,3 sharing more than 400 million SUMs perday [1]. Regarding the aim of this paper, Twitter has severaladvantages over the similar micro-blogging services. First,tweets are up to 140 characters, enhancing the real-time andnews-oriented nature of the platform. In fact, the life-time oftweets is usually very short, thus Twitter is the social network3http://www.statisticbrain.com/twitter-statisticsplatform that is best suited to study SUMs related to real-timeevents [22]. Second, each tweet can be directly associated withmeta-information that constitutes additional information. Third,Twitter messages are public, i.e., they are directly available withno privacy limitations. For all of these reasons, Twitter is a goodsource of information for real-time event detection and analysis.In this paper, we propose an intelligent system, based on textmining and machine learning algorithms, for real-time detectionof traffic events from Twitter stream analysis. The system,after a feasibility study, has been designed and developed fromthe ground as an event-driven infrastructure, built on a ServiceOriented Architecture (SOA) [23]. The system exploits availabletechnologies based on state-of-the-art techniques for textanalysis and pattern classification. These technologies and techniqueshave been analyzed, tuned, adapted, and integrated inorder to build the intelligent system. In particular, we present anexperimental study, which has been performed for determiningthe most effective among different state-of-the-art approachesfor text classification. The chosen approach was integrated intothe final system and used for the on-the-field real-time detectionof traffic events.The paper has the following structure. Section II summarizesrelated work about event detection from social Twitter streamanalysis. Section III outlines the architecture of the proposedsystem for traffic detection, by describing the methodologyused to collect, elaborate, and classify SUMs, with particularreference to SUMs extracted from the Twitter stream.Section IV describes the setup of the system. Section V presentsthe results achieved with different classification models andprovides a comparison with similar works in the literature.Section VI presents the real-world monitoring application forreal-time detection of traffic events. Finally, Section VII providesconcluding remarks.II. RELATED WORKWith reference to current approaches for using social mediato extract useful information for event detection, we need todistinguish between small-scale events and large-scale events.Small-scale events (e.g., traffic, car crashes, fires, or localmanifestations) usually have a small number of SUMs relatedto them, belong to a precise geographic location, and areconcentrated in a small time interval. On the other hand, largescaleevents (e.g., earthquakes, tornados, or the election of apresident) are characterized by a huge number of SUMs, and bya wider temporal and geographic coverage [24]. Consequently,due to the smaller number of SUMs related to small-scaleevents, small-scale event detection is a non-trivial task. Severalworks in the literature deal with event detection from socialnetworks. Many works deal with large-scale event detection [6],[25]–[28] and only a few works focus on small-scale events [9],[12], [24], [29]–[31].Regarding large-scale event detection, Sakaki et al. [6] useTwitter streams to detect earthquakes and typhoons, by monitoringspecial trigger-keywords, and by applying an SVM as abinary classifier of positive events (earthquakes and typhoons)and negative events (non-events or other events). In [25],the authors present a method for detecting real-world events,D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2271such as natural disasters, by analyzing Twitter streams andby employing both NLP and term-frequency-based techniques.Chew et al. [26] analyze the content of tweets shared during theH1N1 (i.e., swine flu) outbreak, containing keywords and hashtagsrelated to the H1N1 event to determine the kind of informationexchanged by social media users. De Longueville et al.[27] analyze geo-tagged tweets to detect forest fire events andoutline the affected area.Regarding small-scale event detection, Agarwal et al. [29]focus on the detection of fires in a factory from Twitter streamanalysis, by using standard NLP techniques and a Naive Bayes(NB) classifier. In [30], information extracted from Twitterstreams is merged with information from emergency networksto detect and analyze small-scale incidents, such as fires.Wanichayapong et al. [12] extract, using NLP techniques andsyntactic analysis, traffic information from microblogs to detectand classify tweets containing place mentions and trafficinformation. Li et al. [31] propose a system, called TEDAS, toretrieve incident-related tweets. The system focuses on Crimeand Disaster-related Events (CDE) such as shootings, thunderstorms,and car accidents, and aims to classify tweets asCDE events by exploiting a filtering based on keywords, spatialand temporal information, number of followers of the user,number of retweets, hashtags, links, and mentions. Sakaki et al.[9] extract, based on keywords, real-time driving informationby analyzing Twitter’s SUMs, and use an SVM classifierto filter “noisy” tweets not related to road traffic events.Schulz et al. [24] detect small-scale car incidents from Twitterstream analysis, by employing semantic web technologies,along with NLP and machine learning techniques. They performthe experiments using SVM, NB, and RIPPER classifiers.In this paper, we focus on a particular small-scale event, i.e.,road traffic, and we aim to detect and analyze traffic eventsby processing users’ SUMs belonging to a certain area andwritten in the Italian language. To this aim, we propose a systemable to fetch, elaborate, and classify SUMs as related to a roadtraffic event or not. To the best of our knowledge, few papershave been proposed for traffic detection using Twitter streamanalysis. However, with respect to our work, all of them focuson languages different from Italian, employ different inputfeatures and/or feature selection algorithms, and consider onlybinary classifications. In addition, a few works employ machinelearning algorithms [9], [24], while the others rely on NLPtechniques only. The proposed system may approach both binaryand multi-class classification problems. As regards binaryclassification, we consider traffic-related tweets, and tweets notrelated with traffic. As regards multi-class classification, wesplit the traffic-related class into two classes, namely trafficcongestion or crash, and traffic due to external event. In thispaper, with external event we refer to a scheduled event (e.g.,a football match, a concert), or to an unexpected event (e.g.,a flash-mob, a political demonstration, a fire). In this way weaim to support traffic and city administrations for managingscheduled or unexpected events in the city.Moreover, the proposed system could work together withother traffic sensors (e.g., loop detectors, cameras, infraredcameras) and ITS monitoring systems for the detection of trafficdifficulties, providing a low-cost wide coverage of the roadFig. 1. System architecture for traffic detection from Twitter stream analysis.network, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.Concluding, the proposed ITS is characterized by the followingstrengths with respect to the current research aimed atdetecting traffic events from social networks: i) it performs amulti-class classification, which recognizes non-traffic, trafficdue to congestion or crash, and traffic due to external events;ii) it detects the traffic events in real-time; and iii) it is developedas an event-driven infrastructure, built on an SOA architecture.As regards the first strength, the proposed ITS could be a valuabletool for traffic and city administrations to regulate trafficand vehicular mobility, and to improve the management ofscheduled or unexpected events. For what concerns the secondstrength, the real-time detection capability allows obtaining reliableinformation about traffic events in a very short time, oftenbefore online news web sites and local newspapers. As far as thethird strength is concerned, with the chosen architecture, we areable to directly notify the traffic event occurrence to the driversregistered to the system, without the need for them to access officialnews websites or radio traffic news channels, to get trafficinformation. In addition, the SOA architecture permits to exploittwo important peculiarities, i.e., scalability of the service(e.g., by using a dedicated server for each geographic area), andeasy integration with other services (e.g., other ITS services).III. ARCHITECTURE OF THE TRAFFIC DETECTION SYSTEMIn this section, our traffic detection system based onTwitter streams analysis is presented. The system architectureis service-oriented and event-driven, and is composed of threemain modules, namely: i) “Fetch of SUMs and Pre-processing”,ii) “Elaboration of SUMs”, iii) “Classification of SUMs”. Thepurpose of the proposed system is to fetch SUMs from Twitter,to process SUMs by applying a few text mining steps, andto assign the appropriate class label to each SUM. Finally, asshown in Fig. 1, by analyzing the classified SUMs, the systemis able to notify the presence of a traffic event.The main tools we have exploited for developing the systemare: 1) Twitter’s API,4 which provides direct access to the4http://dev.twitter.com2272 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015public stream of tweets; 2) Twitter4J,5 a Java library that weused as a wrapper for Twitter’s API; 3) the Java API providedbyWeka (Waikato Environment for Knowledge Analysis) [32],which we mainly employed for data pre-processing and textmining elaboration.We recall that both the “Elaboration of SUMs” and the“Classification of SUMs” modules require setting the optimalvalues of a few specific parameters, by means of a supervisedlearning stage. To this aim, we exploited a training setcomposed by a set of SUMs previously collected, elaborated,and manually labeled. Section IV describes in greater detailhow the specific parameters of each module are set during thesupervised learning stage.In the following, we discuss in depth the elaboration madeon the SUMs by each module of the traffic detection system.A. Fetch of SUMs and Pre-ProcessingThe first module, “Fetch of SUMs and Pre-processing”,extracts raw tweets from the Twitter stream, based on one ormore search criteria (e.g., geographic coordinates, keywordsappearing in the text of the tweet). Each fetched raw tweet contains:the user id, the timestamp, the geographic coordinates,a retweet flag, and the text of the tweet. The text may containadditional information, such as hashtags, links, mentions, andspecial characters. In this paper, we took only Italian languagetweets into account. However, the system can be easily adaptedto cope with different languages.After the SUMs have been fetched according to the specificsearch criteria, SUMs are pre-processed. In order to extract onlythe text of each raw tweet and remove all meta-informationassociated with it, a Regular Expression filter [33] is applied.More in detail, the meta-information discarded are: user id,timestamp, geographic coordinates, hashtags, links, mentions,and special characters. Finally, a case-folding operation isapplied to the texts, in order to convert all characters to lowercase. At the end of this elaboration, each fetched SUM appearsas a string, i.e., a sequence of characters. We denote the jthSUM pre-processed by the first module as SUMj , with j =1, . . . , N, where N is the total number of fetched SUMs.B. Elaboration of SUMsThe second processing module, “Elaboration of SUMs”, isdevoted to transforming the set of pre-processed SUMs, i.e., aset of strings, in a set of numeric vectors to be elaborated bythe “Classification of SUMs” module. To this aim, some textmining techniques are applied in sequence to the pre-processedSUMs. In the following, the text mining steps performed in thismodule are described in detail:a) tokenization is typically the first step of the text miningprocess, and consists in transforming a stream of charactersinto a stream of processing units called tokens (e.g.,syllables, words, or phrases). During this step, other operationsare usually performed, such as removal of punctua-5http://twitter4j.orgtion and other non-text characters [18], and normalizationof symbols (e.g., accents, apostrophes, hyphens, tabs andspaces). In the proposed system, the tokenizer removesall punctuation marks and splits each SUM into tokenscorresponding to words (bag-of-words representation). Atthe end of this step, each SUMj is represented as thesequence of words contained in it. We denote the jthtokenized SUM as SUMTj =_tTj1, . . . , tTjh, . . . , tTjHj_,where tTjh is the hth token and Hj is the total numberof tokens in SUMTj ;b) stop-word filtering consists in eliminating stop-words,i.e., words which provide little or no information to thetext analysis. Common stop-words are articles, conjunctions,prepositions, pronouns, etc. Other stop-words arethose having no statistical significance, that is, those thattypically appear very often in sentences of the consideredlanguage (language-specific stop-words), or in the set oftexts being analyzed (domain-specific stop-words), andcan therefore be considered as noise [34]. The authorsin [35] have shown that the 10 most frequent wordsin texts and documents of the English language areabout the 20–30% of the tokens in a given document.In the proposed system, the stop-word list for the Italianlanguage was freely downloaded from the SnowballTartarus website6 and extended with other ad hoc definedstop-words. At the end of this step, each SUMis thus reduced to a sequence of relevant tokens. Wedenote the jth stop-word filtered SUM as SUMSW_ j =tSWj1 , . . . , tSWjk , . . . , tSWjKj_, where tSWjk is the kth relevanttoken and Kj , with Kj Hj , is the total numberof relevant tokens in SUMSWj . We recall that a relevanttoken is a token that does not belong to the set of stopwords;c) stemming is the process of reducing each word (i.e.,token) to its stem or root form, by removing its suffix. Thepurpose of this step is to group words with the same themehaving closely related semantics. In the proposed system,the stemmer exploits the Snowball Tartarus Stemmer7 forthe Italian language, based on the Porter’s algorithm [36].Hence, at the end of this step each SUM is represented asa sequence of stems extracted from the tokens containedin it. We denote the jth stemmed SUM as SUMS_ j =tSj1, . . . , tSjl, . . . , tSjLj_, where tSjl is the lth stem and Lj ,with Lj Kj , is the total number of stems in SUMSj ;d) stem filtering consists in reducing the number of stems ofeach SUM. In particular, each SUM is filtered by removingfrom the set of stems the ones not belonging to theset of relevant stems. The set of F relevant stems RS ={ˆs1, . . . , ˆsf , . . . , ˆsF } is identified during the supervisedlearning stage that will be discussed in Section IV.At the end of this step, each SUM is represented asa sequence of relevant stems. We denote the jth filteredSUM as SUMSFj =_tSFj1 , . . . , tSFjp , . . . , tSFjPj_, where6http://snowball.tartarus.org/algorithms/italian/stop.txt7http://snowball.tartarus.org/algorithms/italian/stemmer.htmlD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2273Fig. 2. Steps of the text mining elaboration applied to a sample tweet.tSFjpRS is the pth relevant stem and Pj , with Pj Ljand Pj F, is the total number of relevant stems inSUMSFj ;e) feature representation consists in building, for eachSUM, the corresponding vector of numeric features. Indeed,in order to classify the SUMs, we have to representthem in the same feature space. In particular,we consider the F-dimensional set of features X ={X1, . . . , Xf, . . . , XF } corresponding to the set of relevantstems. For each SUMSFj we define the vectorxj = {xj1, . . . , xjf , . . . , xjF } where each element is setaccording to the following formula:xjf =_wf if stem ˆsf SUMSFj0 otherwise.(1)In (1), wf is the numeric weight associated to therelevant stem ˆsf : we will discuss how this weight iscomputed in Section IV.In Fig. 2, we summarize all the steps applied to a sampletweet by the “Elaboration of SUMs” module.C. Classification of SUMsThe third module, “Classification of SUMs”, assigns to eachelaborated SUM a class label related to traffic events. Thus, theoutput of this module is a collection of N labeled SUMs. To theaim of labeling each SUM, a classification model is employed.The parameters of the classification model have been identifiedduring the supervised learning stage. Actually, as it will bediscussed in Section V, different classification models havebeen considered and compared. The classifier that achievedthe most accurate results was finally employed for the realtimemonitoring with the proposed traffic detection system. Thesystem continuously monitors a specific region and notifies thepresence of a traffic event on the basis of a set of rules that canbe defined by the system administrator. For example, when thefirst tweet is recognized as a traffic-related tweet, the systemmay send a warning signal. Then, the actual notification of thetraffic event may be sent after the identification of a certainnumber of tweets with the same label.IV. SETUP OF THE SYSTEMAs stated previously, a supervised learning stage is requiredto perform the setup of the system. In particular, we need toidentify the set of relevant stems, the weights associated witheach of them, and the parameters that describe the classificationmodels. We employ a collection of Ntr labeled SUMs astraining set. During the learning stage, each SUM is elaboratedby applying the tokenization, stop-word filtering, and stemmingsteps. Then, the complete set of stems is built as follows:CS =⎛⎝N_trj=1SUMSj⎞⎠ = {s1, . . . , sq, . . . , sQ}. (2)CS is the union of all the stems extracted from the Ntr SUMsof the training set. We recall that SUMSj is the set of stemsthat describes the jth SUM after the stemming step in thetraining set.Then, we compute the weight of each stem in CS, whichallows us to establish the importance of each stem sq in thecollection of SUMs of the training set, by using the InverseDocument Frequency (IDF) index aswq = IDFq = ln(Ntr/Nq), (3)where Nq is the number of SUMs of the training set in whichthe stem sq occurs [37]. The IDF index is a simplified version ofthe TF-IDF (Term Frequency-IDF) index [38]–[40], where theTF part considers the frequency of a specific stem within eachSUM. In fact, we heuristically found that the same stem seldomappears more than once in an SUM. On the other hand, we performedseveral experiments also with the TF-IDF index and we2274 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015verified that the performance in terms of classification accuracyis similar to the one obtained by using only the IDF index. Thus,we decided to adopt the simpler IDF index as weight.In order to select the set of relevant stems, a feature selectionalgorithm is applied. SUMs are described by a set{S1, . . . , Sq, . . . , SQ} of Q features, where each feature Sqcorresponds to the stem sq. The possible values of feature Sqare wq and 0.Then, as suggested in [41], to evaluate the quality of eachstem sq, we employ a method based on the computation ofthe Information Gain (IG) value between feature Sq and outputC = {c1, . . . , cr, . . . , cR}, where cr is one of the R possibleclass labels (two or three in our case). The IG value between Sqand C is calculated as IG(C, Sq) = H(C) H(C|Sq), whereH(C) represents the entropy of C, and H(C|Sq) represents theentropy of C after the observation of feature Sq.Finally, we identified the set of relevant stems RS by selectingall the stems which have a positive IG value. We recall thatthe stem selection process based on IG values is a standard andeffective method widely used in the literature [40], [42].The last part of the supervised learning stage regards theidentification of the most suited classification models and thesetting of their structural parameters. We took into accountseveral classification algorithms widely used in the literaturefor text classification tasks [43], namely, i) SVM [44], ii) NB[45], iii) C4.5 decision tree [46], iv) k-nearest neighbor (kNN)[47], and v) PART [48]. The learning algorithms used to buildthe aforementioned classifiers will be briefly discussed in thefollowing section.V. EVALUATION OF THE TRAFFIC DETECTION SYSTEMIn this section, we discuss the evaluation of the proposedsystem. We performed several experiments using two differentdatasets. For each dataset, we built and compared seven differentclassification models: SVM, NB, C4.5, kNN (with k equalto 1, 2, and 5), and PART. In the following, we describe howwe generated the datasets to complete the setup of the system,and we recall the employed classification models. Then, wepresent the achieved results, and the statistical metrics used toevaluate the performance of the classifiers. Finally, we providea comparison with some results extracted from other works inthe literature.A. Description of the DatasetsWe built two different datasets, i.e., a 2-class dataset, and a3-class dataset. For each dataset, tweets in the Italian languagewere collected using the “Fetch of SUMs and Pre-processing”module by setting some search criteria (e.g., presence of keywords,geographic coordinates, date and time of posting). Then,the SUMs were manually labeled, by assigning the correct classlabel.1) 2-Class Dataset: The first dataset consists of tweetsbelonging to two possible classes, namely i) road traffic-relatedtweets (traffic class), and ii) tweets not related with road traffic(non-traffic class). The tweets were fetched in a time span ofabout four hours from the same geographic area. First, wefetched candidate tweets for traffic class by using the followingsearch criteria:— geographic area of origin of the tweet: Italy. We setthe center of the area in Rome (latitude and longitudeequal to 4153’ 35” and 1228’ 58”, respectively)and we set a radius of about 600 km to cover approximatelythe whole country;— time and date of posting: tweets belong to a timespan of four evening hours of two weekend days ofMay 2013;— keywords contained in the text of the tweet: we applythe or-operator on the set of keywords S1, composedby the three most frequently used traffic-relatedkeywords, S1 = {“traffico”(traffic), “coda”(queue),“incidente”(crash)} , with the aim of selecting tweetscontaining at least one of the above keywords. Theresulting condition can be expressed by:CondA: “traffico” or “coda” or “incidente”.Then, we fetched the candidate tweets for non-traffic classusing the same search criteria for geographic area, and timeand date, but without setting any keyword. Obviously, this time,tweets containing traffic-related keywords from set S1, alreadyfound in the previous fetch, were discarded.Finally, the tweets were manually labeled with two possibleclass labels, i.e., as related to road traffic event (traffic), e.g.,accidents, jams, queues, or not (non-traffic). More in detail,first we read, interpreted, and correctly assigned a traffic classlabel to each candidate traffic class tweet. Among all candidatetraffic class tweets, we actually labeled 665 tweets with thetraffic class. About 4% of candidate traffic class tweets werenot labeled with the traffic class label.With the aim of correctlytraining the system, we added these tweets to the non-trafficclass. Indeed, we collected also a number of tweets containingthe traffic-related keywords defined in S1, but actually notconcerning road traffic events. Such tweets are related to, e.g.,illegal drug trade, network traffic, or organ trafficking. It isworth noting that, as it happens in the English language, severalwords in the Italian language, e.g., “traffic” or “incident”, aresuitable in several contexts. So, for instance, the events “trafficodi droga” (drug trade), “traffico di organi” (organ trafficking),“incidente diplomatico” (diplomatic scandal), “traffico dati”(network traffic) could be easily mistaken for road trafficrelatedevents.Then, in order to obtain a balanced dataset, we randomlyselected tweets from the candidate tweets of non-traffic classuntil reaching 665 non-traffic class tweets, and we manuallyverified that the selected tweets did not belong to the trafficclass. Thus, the final 2-class dataset consists of 1330 tweets andis balanced, i.e., it contains 665 tweets per class.Table I shows the textual part of a selection of tweets fetchedby the system with the corresponding, manually added, classlabel. In Table I, tweets #1, #2 and #3 are examples of trafficclass tweets, tweet #4 is an example of a non-traffic class tweet,tweets #5 and #6 are examples of tweets containing trafficrelatedkeywords, but belonging to the non-traffic class. In thetable, for an easier understanding, the keywords appearing inD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2275TABLE ISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 2-CLASS DATASETTABLE IISIGNIFICANT FEATURES RELATED TO THE TRAFFIC CLASSthe text of each tweet are underlined. Table II shows some of themost important textual features (i.e., stems) and their meaning,related to the traffic class tweets, identified by the system forthis dataset.2) 3-Class Dataset: The second dataset consists of tweetsbelonging to three possible classes. In this case we want todiscriminate if traffic is caused by an external event (e.g., a footballmatch, a concert, a flash-mob, a political demonstration,a fire) or not. Even though the current release of the systemwas not designed to identify the specific event, knowing thatthe traffic difficulty is caused by an external event could beuseful to traffic and city administrations, for regulating trafficand vehicular mobility, or managing scheduled events in thecity. More in detail, we took into account four possible externalevents, namely, i) matches, ii) processions, iii) music concerts,and iv) demonstrations. Thus, in this dataset the three possibleclasses are: i) traffic due to external event, ii) traffic congestionor crash, and iii) non-traffic. The tweets were fetched in asimilar way as described before. More in detail, first, we fetchedcandidate road traffic-related tweets due to an external event(traffic due to external event class) according to the followingsearch criteria:— geographic area of origin of the tweet: Italy, parametersset as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: foreach external event aforementioned, we took into accountonly one keyword, thus obtaining the set S2 ={“partita”(match), “processione” (procession), “concerto”(concert), “manifestazione” (demonstration)}.Next we combined each keyword representing theexternal event with one of the traffic-related keywordsfrom set S3 = {“traffico”(traffic), “coda”(queue)}.Finally, we applied the and-operator between eachkeyword from set S2 and the conditionCondB expressed as:CondB: “traffico” or “coda”,thus obtaining the following conditions:CondC: CondB and “partita”,CondD: CondB and “processione”CondE: CondB and “concerto”,CondF : CondB and “manifestazione”.Then, we fetched candidate tweets related to traffic congestion,crashes, and jams (traffic congestion or crash class) byusing the following search criteria:— geographic area of origin of the tweet: Italy, parametersset as as in the case of the 2-class dataset;— time and date of posting: parameters set as in the caseof the 2-class dataset, but different hours of the sameweekend days are used;— keywords contained in the text of the tweet: we combinedthe mentioned above keywords from set S1 inthree possible sets: S4={“traffico”(traffic), “incidente”(crash)}, S5 = {“incidente”(crash), coda(queue)},and the already defined set S3. Then we used theand-operator to define the exploited conditions asfollows:CondG: “traffico” and “incidente”,CondH: “traffico” and “coda”,CondI : “incidente” and “coda”.Obviously, as done before, tweets containing external eventrelatedkeywords, already found in the previous fetch, werediscarded. Further, we fetched the candidate tweets of nontrafficclass using the same search criteria for geographic area,and time and date, but without setting any keyword. Again,tweets already found in previous fetches were discarded.Finally, the tweets were manually labeled with three possibleclass labels. We first labeled the candidate tweets of trafficdue to external event class (this set of tweets was the smallerone), and we identified 333 tweets actually associated with thisclass. Then, we randomly selected 333 tweets for each of the2276 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE IIISOME EXAMPLES OF TWEETS AND CORRESPONDING CLASSES FOR THE 3-CLASS DATASETtwo remaining classes. Also, in this case, we manually verifiedthe correctness of the labels associated to the selected tweets.Finally, as done before, we added to the non-traffic class alsotweets containing keywords related to traffic congestion and toexternal events but not concerning road-traffic events. The final3-class dataset consists of 999 tweets and it is balanced, i.e., ithas 333 tweets per class.Table III shows a selection of tweets fetched by the systemfor the 3-class dataset, with the corresponding, manually added,class label. In Table III, tweets #1, #2, #3 and #4 are examplesof tweets belonging to the class traffic due to external event: inmore detail, #1 is related to a procession event, #2 is relatedto a match event, #3 is related to a concert event, and #4is related to a demonstration event. Tweet #5 is an exampleof a tweet belonging to the class traffic congestion or crash,while tweets #6 and #7 are examples of non-traffic class tweets.Words underlined in the text of each tweet represent involvedkeywords.B. Employed Classification ModelsIn the following we briefly describe the main properties ofthe employed and experimented classification models.SVMs, introduced for the first time in [49], are discriminativeclassification algorithms based on a separating hyper-planeaccording to which new samples can be classified. The besthyper-plane is the one with the maximum margin, i.e., thelargest minimum distance, from the training samples and iscomputed based on the support vectors (i.e., samples of thetraining set). The SVM classifier employed in this work is theimplementation described in [44].The NB classifier [45] is a probabilistic classification algorithmbased on the application of the Bayes’s theorem, andis characterized by a probability model which assumes independenceamong the input features. In other words, the modelassumes that the presence of a particular feature is unrelated tothe presence of any other feature.The C4.5 decision tree algorithm [46] generates a classificationdecision tree by recursively dividing up the training dataaccording to the values of the features. Non-terminal nodesof the decision tree represent tests on one or more features,while terminal nodes represent the predicted output, namely theclass. In the resulting decision tree each path (from the rootto a leaf) identifies a combination of feature values associatedwith a particular classification. At each level of the tree, thealgorithm chooses the feature that most effectively splits thedata, according to the highest information gain.The kNN algorithm [50] belongs to the family of “lazy”classification algorithms. The basic functioning principle is thefollowing: each unseen sample is compared with a number ofpre-classified training samples, and its similarity is evaluatedaccording to a simple distance measure (e.g., we employed thenormalized Euclidean distance), in order to find the associatedoutput class. The parameter k allows specifying the number ofneighbors, i.e., training samples to take into account for theclassification. We focus on three kNN models with k equal to1, 2, and 5. The kNN classifier employed in this work followsthe implementation described in [47].The PART algorithm [48] combines two rule generationmethods, i.e., RIPPER [51] and C4.5 [46]. It infers classificationrules by repeatedly building partial, i.e., incomplete,C4.5 decision trees and by using the separate-and-conquer rulelearning technique [52].C. Experimental ResultsIn this section, we present the classification results achievedby applying the classifiers mentioned in Section V-B to thetwo datasets described in Section V-A. For each classifier theexperiments were performed using an n-fold stratified crossvalidationmethodology. In n-fold stratified cross-validation,the dataset is randomly partitioned into n folds and the classesin each fold are represented with the same proportion as inthe original data. Then, the classification model is trained onn 1 folds, and the remaining fold is used for testing themodel. The procedure is repeated n times, using as test dataeach of the n folds exactly once. The n test results are finallyaveraged to produce an overall estimation. We repeated ann-fold stratified cross-validation, with n = 10, for two times,using two different seed values to randomly partition the datainto folds.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2277TABLE IVSTATISTICAL METRICSWe recall that, for each fold, we consider a specific trainingset which is used in the supervised learning stage to learnboth the pre-processing (i.e., the set of relevant stems and theirweights) and the classification model parameters.To evaluate the achieved results, we employed the mostfrequently used statistical metrics, i.e., precision, accuracy,recall, and F-score. To explain the meaning of the metrics,we will refer, for the sake of simplicity, to the case of abinary classification, i.e., positive class versus negative class.In fact, in the case of a multi-class classification, the metricsare computed per class and the overall statistical measure issimply the average of the per-class measures. The correctness ofa classification can be evaluated according to four values: i) truepositives (TP): the number of real positive samples correctlyclassified as positive; ii) true negatives (TN): the number ofreal negative samples correctly classified as negative; iii) falsepositives (FP): the number of real negative samples incorrectlyclassified as positive; iv) false negatives (FN): the number ofreal positive samples incorrectly classified as negative.Based on the previous definitions, we can now formallydefine the employed statistical metrics and provide, in Table IV,the corresponding equations. Accuracy represents the overalleffectiveness of the classifier and corresponds to the number ofcorrectly classified samples over the total number of samples.Precision is the number of correctly classified samples of aclass, i.e., positive class, over the number of samples classifiedas belonging to that class. Recall is the number of correctlyclassified samples of a class, i.e., positive class, over the numberof samples of that class; it represents the effectiveness of theclassifier to identify positive samples. The F-score (typicallyused with β = 1 for class-balanced datasets) is the weightedharmonic mean of precision and recall and it is used to comparedifferent classifiers.In the first experiment, we performed a classification oftweets using the 2-class dataset (R = 2) consisting of 1330tweets, described in Section V-A. The aim is to assign a classlabel (traffic or non-traffic) to each tweet.Table V shows the average classification results obtained bythe classifiers on the 2-class dataset. More in detail, the tableshows for each classifier, the accuracy, and the per-class valueof recall, precision, and F-score. All the values are averagedover the 20 values obtained by repeating two times the 10-foldcross validation. The best classifier resulted to be the SVM withan average accuracy of 95.75%.As Table VI clearly shows, the results achieved by our SVMclassifier appreciably outperform those obtained in similarworks in the literature [9], [12], [24], [31] despite they refer todifferent datasets. More precisely, Wanichayapong et al. [12]obtained an accuracy of 91.75% by using an approach thatconsiders the presence of place mentions and special keywordsin the tweet. Li et al. [31] achieved an accuracy of 80% fordetecting incident-related tweets using Twitter specific features,such as hashtags, mentions, URLs, and spatial and temporalinformation. Sakaki et al. [9] employed an SVM to identifyheavy-traffic tweets and obtained an accuracy of 87%. Finally,Schulz et al. [24], by using SVM, RIPPER, and NB classifiers,obtained accuracies of 89.06%, 85.93%, and 86.25%, respectively.In the case of SVM, they used the following features:word n-grams, TF-IDF score, syntactic and semantic features.In the case of NB and RIPPER they employed the same set offeatures except semantic features.In the second experiment, we performed a classificationof tweets over three classes (R = 3), namely, traffic due toexternal event, traffic congestion or crash, and non-traffic, withthe aim of discriminating the cause of traffic. Thus, we employedthe 3-class dataset consisting of 999 tweets, describedin Section V-A. We employed again the classifiers previouslyintroduced and the obtained results are shown in Table VII.The best classifier resulted to be again SVM with an averageaccuracy of 88.89%.In order to verify if there exist statistical differences amongthe values of accuracy achieved by the seven classificationmodels, we performed a statistical analysis of the results. Wetook into account the model which obtains the best averageaccuracy, i.e., the SVM model. As suggested in [53], we appliednon-parametric statistical tests: for each classifier we generateda distribution consisting of the 20 values of the accuracieson the test set obtained by repeating two times the 10-foldcross validation. We statistically compared the results achievedby the SVM model with the ones achieved by the remainingmodels. We applied the Wilcoxon signed-rank test [54], whichdetects significant differences between two distributions. In allthe tests, we used α = 0.05 as level of significance. Tables VIIIand IX show the results of the Wilcoxon signed-rank test, relatedto the 2-class and the 3-class problems, respectively. In thetables R+ and Rdenote, respectively, the sum of ranks for thefolds in which the first model outperformed the second, andthe sum of ranks for the opposite condition. Since the p-valuesare always lower than the level of significance we can alwaysreject the statistical hypothesis of equivalence. For this reason,we can state that the SVM model statistically outperforms allthe other approaches on both the problems.VI. REAL-TIME DETECTION OF TRAFFIC EVENTSThe developed system was installed and tested for the realtimemonitoring of several areas of the Italian road network,by means of the analysis of the Twitter stream coming fromthose areas. The aim is to perform a continuous monitoring offrequently busy roads and highways in order to detect possibletraffic events in real-time or even in advance with respect to thetraditional news media [55], [56]. The system is implemented as2278 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE VCLASSIFICATION RESULTS ON THE 2-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIRESULTS OF THE CLASSIFICATION OF TWEETS IN OTHER WORKS IN THE LITERATURETABLE VIICLASSIFICATION RESULTS ON THE 3-CLASS DATASET (BEST VALUES IN BOLD)TABLE VIIIRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 2-CLASS DATASETa service of a wider service-oriented platform to be developedin the context of the SMARTY project [23]. The service canbe called by each user of the platform, who desires to knowTABLE IXRESULTS OF THE WILCOXON SIGNED-RANK TEST ON THE ACCURACIESOBTAINED ON THE TEST SET FOR THE 3-CLASS DATASETthe traffic conditions in a certain area. In this section, weaim to show the effectiveness of our system in determiningtraffic events in short time. We just present some results for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2279TABLE XREAL-TIME DETECTION OF TRAFFIC EVENTS2-class problem. For the setup of the system, we have employedas training set the overall dataset described in Section V-A.We adopt only the best performing classifier, i.e., the SVMclassifier. During the learning stage, we identified Q = 3227features, which were reduced to F = 582 features after thefeature selection step.2280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015TABLE X(CONTINUED.) REAL-TIME DETECTION OF TRAFFIC EVENTSThe system continuously performs the following operations:it i) fetches, with a time frequency of z minutes, tweets originatedfrom a given area, containing the keywords resulting fromCondA, ii) performs a real-time classification of the fetchedtweets, iii) detects a possible traffic-related event, by analyzingthe traffic class tweets from the considered area, and, if needed,sends one or more traffic warning signals with increasingintensity for that area. More in detail, a first low-intensitywarning signal is sent when m traffic class tweets are foundin the considered area in the same or in subsequent temporalwindows. Then, as the number of traffic class tweets grows,the warning signal becomes more reliable, thus more intense.The value of m was set based on heuristic considerations,depending, e.g., on the traffic density of the monitored area.In the experiments we set m = 1. As regards the fetching frequencyz, we heuristically found that z = 10 minutes representsa good compromise between fast event detection and systemscalability. In fact, z should be set depending on the number ofmonitored areas and on the volume of tweets fetched.With the aim of evaluating the effectiveness of our system,we need that each detected traffic-related event is appropriatelyvalidated. Validation can be performed in different wayswhich include: i) direct communication by a person, who waspresent at the moment of the event, ii) reports drawn up by thepolice and/or local administrations (available only in case ofincidents), iii) radio traffic news; iv) official real-time trafficnews web sites; v) local newspapers (often the day after theevent and only when the event is very significant).Direct communication is possible only if a person is presentat the event and can communicate this event to us. Although wehave tried to sensitize a number of users, we did not obtain anadequate feedback. Official reports are confidential: police andlocal administrations barely allow accessing to these reports,and, when this permission is granted, reports can be consultedonly after several days. Radio traffic news are in general quiteprecise in communicating traffic-related events in real time. Unfortunately,to monitor and store the events, we should dedicatea person or adopt some tool for audio analysis. We realizedhowever that the traffic-related events communicated on theradio are always mentioned also in the official real-time trafficnews web sites. Actually, on the radio, the speaker typicallyreads the news reported on the web sites. Local newspapersfocus on local traffic-related events and often provide eventswhich are not published on official traffic news web sites.Concluding, official real-time traffic news web sites and localnewspapers are the most reliable and effective sources of informationfor traffic-related events. Thus, we decided to analyzetwo of the most popular real-time traffic news web sites for theD’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2281Italian road network, namely “CCISS Viaggiare informati”,8managed by the Italian government Ministry for infrastructuresand transports, and “Autostrade per l’Italia”,9 the official website of Italian highway road network. Further, we examinedlocal newspapers published in the zones where our system wasable to detect traffic-related events.Actually, it was really difficult to find realistic data to test theproposed system, basically for two reasons: on the one hand, wehave realized that real traffic events are not always notified inofficial news channels; on the other hand, situations of trafficslowdown may be detected by traditional traffic sensors but,at the same time, may not give rise to tweets. In particular,in relation to this latter reason, it is well known that driversusually share a tweet about a traffic event only when theevent is unexpected and really serious, i.e., it forces to stopthe car. So, for instance, they do not share a tweet in caseof road works, minor traffic difficulties, or usual traffic jams(same place and same time). In fact, in correspondence tominor traffic jams we rarely find tweets coming from the affectedarea.We have tried to build a meaningful set of traffic events,related to some major Italian cities, of which we have found anofficial confirmation. The selected set includes events correctlyidentified by the proposed system and confirmed via officialtraffic news web sites or local newspapers. The set of trafficevents, whose information is summarized in Table X, consistsof 70 events detected by our system. The events are relatedboth to highways and to urban roads, and were detected duringSeptember and early October 2014.Table X shows the information about the event, the time ofdetection from Twitter’s stream fetched by our system, the timeof detection from official news websites or local newspapers,and the difference between these two times. In the table, positivedifferences indicate a late detection with respect to officialnews web sites, while negative differences indicate an earlydetection. The symbol “-” indicates that we found the officialconfirmation of the event by reading local newspapers severalhours late. More precisely, the system detects in advance 20events out of 59 confirmed by news web sites, and 11 eventsconfirmed the day after by local newspapers. Regarding the39 events not detected in advance we can observe that 25 ofsuch events are detected within 15 minutes from their officialnotification, while the detection of the remaining 14 eventsoccurs beyond 15 minutes but within 50 minutes. We wish topoint out, however, that, even in the cases of late detection, oursystem directly and explicitly notifies the event occurrence tothe drivers or passengers registered to the SMARTY platform,on which our system runs. On the contrary, in order to get trafficinformation, the drivers or passengers usually need to searchand access the official news websites, which may take sometime and effort, or to wait for getting the information from theradio traffic news.As future work, we are planning to integrate our systemwith an application for analyzing the official traffic news websites, so as to capture traffic condition notifications in real-time.8http://www.cciss.it/9http://www.autostrade.it/autostrade-gis/gis.doThus, our system will be able to signal traffic-related eventsin the worst case at the same time of the notifications on theweb sites. Further, we are investigating the integration of oursystem into a more complex traffic detection infrastructure.This infrastructure may include both advanced physical sensorsand social sensors such as streams of tweets. In particular, socialsensors may provide a low-cost wide coverage of the roadnetwork, especially in those areas (e.g., urban and suburban)where traditional traffic sensors are missing.VII. CONCLUSIONIn this paper, we have proposed a system for real-timedetection of traffic-related events from Twitter stream analysis.The system, built on a SOA, is able to fetch and classify streamsof tweets and to notify the users of the presence of trafficevents. Furthermore, the system is also able to discriminate if atraffic event is due to an external cause, such as football match,procession and manifestation, or not.We have exploited available software packages and state-ofthe-art techniques for text analysis and pattern classification.These technologies and techniques have been analyzed, tuned,adapted and integrated in order to build the overall systemfor traffic event detection. Among the analyzed classifiers, wehave shown the superiority of the SVMs, which have achievedaccuracy of 95.75%, for the 2-class problem, and of 88.89%for the 3-class problem, in which we have also considered thetraffic due to external event class.The best classification model has been employed for realtimemonitoring of several areas of the Italian road network.Wehave shown the results of a monitoring campaign, performed inSeptember and early October 2014. We have discussed the capabilityof the system of detecting traffic events almost in realtime,often before online news web sites and local newspapers.ACKNOWLEDGMENTWe would like to thank Fabio Cempini for the implementationof some parts of the system presented in this paper.REFERENCES[1] F. Atefeh and W. Khreich, “A survey of techniques for event detection inTwitter,” Comput. Intell., vol. 31, no. 1, pp. 132–164, 2015.[2] P. Ruchi and K. Kamalakar, “ET: Events from tweets,” in Proc. 22ndInt. Conf. World Wide Web Comput., Rio de Janeiro, Brazil, 2013,pp. 613–620.[3] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, “Measurement and analysis of online social networks,”in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA,USA, 2007, pp. 29–42.[4] G. Anastasi et al., “Urban and social sensing for sustainable mobilityin smart cities,” in Proc. IFIP/IEEE Int. Conf. Sustainable Internet ICTSustainability, Palermo, Italy, 2013, pp. 1–4.[5] A. Rosi et al., “Social sensors and pervasive services: Approaches andperspectives,” in Proc. IEEE Int. Conf. PERCOM Workshops, Seattle,WA, USA, 2011, pp. 525–530.[6] T. Sakaki, M. Okazaki, and Y.Matsuo, “Tweet analysis for real-time eventdetection and earthquake reporting system development,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 919–931, Apr. 2013.[7] J. Allan, Topic Detection and Tracking: Event-Based InformationOrganization. Norwell, MA, USA: Kluwer, 2002.[8] K. Perera and D. Dias, “An intelligent driver guidance tool using locationbased services,” in Proc. IEEE ICSDM, Fuzhou, China, 2011,pp. 246–251.2282 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015[9] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,“Real-time event extraction for driving information from social sensors,”in Proc. IEEE Int. Conf. CYBER, Bangkok, Thailand, 2012,pp. 221–226.[10] B. Chen and H. H. Cheng, “A review of the applications of agent technologyin traffic and transportation systems,” IEEE Trans. Intell. Transp.Syst., vol. 11, no. 2, pp. 485–497, Jun. 2010.[11] A. Gonzalez, L. M. Bergasa, and J. J. Yebes, “Text detection and recognitionon traffic panels from street-level imagery using visual appearance,”IEEE Trans. Intell. Transp. Syst., vol. 15, no. 1, pp. 228–238,Feb. 2014.[12] N. Wanichayapong, W. Pruthipunyaskul, W. Pattara-Atikom, andP. Chaovalit, “Social-based traffic information extraction and classification,”in Proc. 11th Int. Conf. ITST, St. Petersburg, Russia, 2011,pp. 107–112.[13] P. M. d’Orey and M. Ferreira, “ITS for sustainable mobility: A surveyon applications and impact assessment tools,” IEEE Trans. Intell. Transp.Syst., vol. 15, no. 2, pp. 477–493, Apr. 2014.[14] K. Boriboonsomsin, M. Barth, W. Zhu, and A. Vu, “Eco-routing navigationsystem based on multisource historical and real-time trafficinformation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,pp. 1694–1704, Dec. 2012.[15] J. Hurlock and M. L. Wilson, “Searching twitter: Separating the tweetfrom the chaff,” in Proc. 5th AAAI ICWSM, Barcelona, Spain, 2011,pp. 161–168.[16] J. Weng and B.-S. Lee, “Event detection in Twitter,” in Proc. 5th AAAIICWSM, Barcelona, Spain, 2011, pp. 401–408.[17] S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: PredictiveMethods for Analyzing Unstructured Information. Berlin, Germany:Springer-Verlag, 2004.[18] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,”LDV Forum-GLDV J. Comput. Linguistics Lang. Technol., vol. 20, no. 1,pp. 19–62, May 2005.[19] V. Gupta, S. Gurpreet, and S. Lehal, “A survey of text mining techniquesand applications,” J. Emerging Technol. Web Intell., vol. 1, no. 1,pp. 60–76, Aug. 2009.[20] V. Ramanathan and T. Meyyappan, “Survey of text mining,” in Proc. Int.Conf. Technol. Bus. Manage., Dubai, UAE, 2013, pp. 508–514.[21] M.W. Berry and M. Castellanos, Survey of Text Mining. NewYork,NY,USA: Springer-Verlag, 2004.[22] H. Takemura and K. Tajima, “Tweet classification based on their lifetimeduration,” in Proc. 21st ACM Int. CIKM, Shanghai, China, 2012,pp. 2367–2370.[23] The Smarty project. [Online]. Available: http://www.smarty.toscana.it/[24] A. Schulz, P. Ristoski, and H. Paulheim, “I see a car crash: Real-timedetection of small scale incidents in microblogs,” in The Semantic Web:ESWC 2013 Satellite Events, vol. 7955. Berlin, Germany: Springer-Verlag, 2013, pp. 22–33.[25] M. Krstajic, C. Rohrdantz, M. Hund, and A. Weiler, “Getting there first:Real-time detection of real-world incidents on Twitter” in Proc. 2nd IEEEWork Interactive Vis. Text Anal.—Task-Driven Anal. Soc. Media IEEEVisWeek,” Seattle, WA, USA, 2012.[26] C. Chew and G. Eysenbach, “Pandemics in the age of Twitter: Contentanalysis of tweets during the 2009 H1N1 outbreak,” PLoS ONE, vol. 5,no. 11, pp. 1–13, Nov. 2010.[27] B. De Longueville, R. S. Smith, and G. Luraschi, “OMG, from here, I cansee the flames!: A use case of mining location based social networks toacquire spatio-temporal data on forest fires,” in Proc. Int. Work. LBSN,2009 Seattle, WA, USA, pp. 73–80.[28] J. Yin, A. Lampert, M. Cameron, B. Robinson, and R. Power, “Usingsocial media to enhance emergency situation awareness,” IEEE Intell.Syst., vol. 27, no. 6, pp. 52–59, Nov./Dec. 2012.[29] P. Agarwal, R. Vaithiyanathan, S. Sharma, and G. Shro, “Catching thelong-tail: Extracting local news events from Twitter,” in Proc. 6th AAAIICWSM, Dublin, Ireland, Jun. 2012, pp. 379–382.[30] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao,“Twitcident: fighting fire with information from social web streams,”in Proc. ACM 21st Int. Conf. Comp. WWW, Lyon, France, 2012,pp. 305–308.[31] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang, “TEDAS: A Twitterbasedevent detection and analysis system,” in Proc. 28th IEEE ICDE,Washington, DC, USA, 2012, pp. 1273–1276.[32] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The WEKA data mining software: An update,” SIGKDDExplor. Newsl., vol. 11, no. 1, pp. 10–18, Jun. 2009.[33] M. Habibi, Real World Regular Expressions with Java 1.4. Berlin,Germany: Springer-Verlag, 2004.[34] Y. Zhou and Z.-W. Cao, “Research on the construction and filter methodof stop-word list in text preprocessing,” in Proc. 4th ICICTA, Shenzhen,China, 2011, vol. 1, pp. 217–221.[35] W. Francis and H. Kucera, “Frequency analysis of English usage:Lexicon and grammar,” J. English Linguistics, vol. 18, no. 1, pp. 64–70,Apr. 1982.[36] M. F. Porter, “An algorithm for suffix stripping,” Program: Electron.Library Inf. Syst., vol. 14, no. 3, pp 130–137, 1980.[37] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[38] L. M. Aiello et al., “Sensing trending topics in Twitter,” IEEE Trans.Multimedia, vol. 15, no. 6, pp. 1268–1282, Oct. 2013.[39] C. Shang, M. Li, S. Feng, Q. Jiang, and J. Fan, “Feature selection viamaximizing global information gain for text classification,” Knowl.-BasedSyst., vol. 54, pp. 298–309, Dec. 2013.[40] L. H. Patil and M. Atique, “A novel feature selection based on informationgain using WordNet,” in Proc. SAI Conf., London, U.K., 2013,pp. 625–629.[41] M. A. Hall and G. Holmes. “Benchmarking attribute selection techniquesfor discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,no. 6, pp. 1437–1447, Nov./Dec. 2003.[42] H. U˘guz, “A two-stage feature selection method for text categorization byusing information gain, principal component analysis and genetic algorithm,”Knowl.-Based Syst., vol. 24, no. 7, pp. 1024–1032, Oct. 2011.[43] Y. Aphinyanaphongs et al., “A comprehensive empirical comparisonof modern supervised classification and feature selection methods fortext categorization,” J. Assoc. Inf. Sci. Technol., vol. 65, no. 10,pp. 1964–1987, Oct. 2014.[44] J. Platt, “Fast training of support vector machines using sequentialminimal optimization,” in Advances in Kernel Methods: Support VectorLearning, B. Schoelkopf, C. J. C. Burges and A. J. Smola, Eds.Cambridge, MA, USA, MIT Press, 1999, pp. 185–208.[45] G. H. John and P. Langley, “Estimating continuous distributions inBayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,San Mateo, CA, 1995, pp. 338–345.[46] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA,USA: Morgan Kaufmann, 1993.[47] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learningalgorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan. 1991.[48] E. Frank and I. H. Witten, “Generating accurate rule sets withoutglobal optimization,” in Proc. 15th ICML, Madison, WI, USA, 1998,pp. 144–151.[49] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, Sep. 1995.[50] T. T. Cover and P. E. Hart, “Nearest neighbour pattern classification,”IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.[51] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th ICML, TahoeCity, CA, USA, 1995, pp. 115–123.[52] G. Pagallo and D. Haussler, “Boolean feature discovery in empiricallearning,” Mach. Learn., vol. 5, no. 1, pp. 71–99, Mar. 1990.[53] J. Derrac, S. Garcia, D. Molina, and F. Herrera, “A practical tutorial onthe use of nonparametric statistical tests as a methodology for comparingevolutionary and swarm intelligence algorithms,” Swarm Evol. Comput.,vol. 1, no. 1, pp. 3–18, Mar. 2011.[54] F. Wilcoxon, “Individual comparisons by ranking methods,” BiometricsBull. , vol. 1, no. 6, pp. 80–83, Dec. 1945.[55] H. Becker, M. Naaman, and L. Gravano, “Beyond trending topics:Real-world event identification on Twitter,” in Proc. 5th AAAI ICWSM,Barcelona, Spain, 2011, pp. 438–441.[56] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social networkor a news media?” in Proc. ACM 19th Int. Conf. WWW, Raleigh, NY,USA, 2010, pp. 591–600.Eleonora D’Andrea received the M.S. degree incomputer engineering for enterprise managementand the Ph.D. degree in information engineeringfrom University of Pisa, Pisa, Italy, in 2009 and 2013,respectively.She is a Research Fellow with the Research Center“E. Piaggio,” University of Pisa. She has coauthoredseveral papers in international journals and conferenceproceedings. Her main research interests includecomputational intelligence techniques for simulationand prediction, applied to various fields, suchas energy consumption in buildings or energy production in solar photovoltaicinstallations.D’ANDREA et al.: REAL-TIME DETECTION OF TRAFFIC FROM TWITTER STREAM ANALYSIS 2283Pietro Ducange received the M.Sc. degree in computerengineering and the Ph.D. degree in informationengineering from University of Pisa, Pisa, Italy,in 2005 and 2009, respectively.He is an Associate Professor with the Faculty ofEngineering, eCampus University, Novedrate, Italy.He has coauthored more than 30 papers in internationaljournals and conference proceedings. Hismain research interests focus on designing fuzzyrule-based systems with different tradeoffs betweenaccuracy and interpretability by using multiobjectiveevolutionary algorithms. He currently serves the following international journalsas a member of the Editorial Board: Soft Computing and InternationalJournal of Swarm Intelligence and Evolutionary Computation.Beatrice Lazzerini (M’98) is a Full Professor withthe Department of Information Engineering, Universityof Pisa, Pisa, Italy. She has cofounded theComputational Intelligence Group in the Departmentof Information Engineering, University of Pisa. Shehas coauthored seven books and has published over200 papers in international journals and conferences.She is a coeditor of two books. Her research interestsare in the field of computational intelligence and itsapplications to pattern classification, pattern recognition,risk analysis, risk management, diagnosis,forecasting, and multicriteria decision making. She was involved and hadroles of responsibility in several national and international research projects,conferences, and scientific events.Francesco Marcelloni (M’06) received the Laureadegree in electronics engineering and the Ph.D. degreein computer engineering from University ofPisa, Pisa, Italy, in 1991 and 1996, respectively.He is an Associate Professor with University ofPisa. He has cofounded the Computational IntelligenceGroup in the Department of Information Engineering,University of Pisa, in 2002. He is alsothe Founder and Head of the Competence Centreon MObile Value Added Services (MOVAS). Hehas coedited three volumes and four journal SpecialIssues and is the (co)author of a book and of more than 190 papers ininternational journals, books, and conference proceedings. His main researchinterests include multiobjective evolutionary fuzzy systems, situation-awareservice recommenders, energy-efficient data compression and aggregation inwireless sensor nodes, and monitoring systems for energy efficiency in buildings.Currently, he is an Associate Editor for Information Sciences (Elsevier)and Soft Computing (Springer) and is on the Editorial Board of four otherinternational journals.