Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care

Abstract—Intelligently extracting knowledge from social mediahas recently attracted great interest from the Biomedical andHealth Informatics community to simultaneously improve healthcareoutcomes and reduce costs using consumer-generated opinion.We propose a two-step analysis framework that focuses on positiveand negative sentiment, as well as the side effects of treatment, inusers’ forum posts, and identifies user communities (modules) andinfluential users for the purpose of ascertaining user opinion ofcancer treatment. We used a self-organizing map to analyze wordfrequency data derived from users’ forum posts. We then introduceda novel network-based approach for modeling users’ foruminteractions and employed a network partitioning method based onoptimizing a stability qualitymeasure.This allowed us to determineconsumer opinion and identify influential users within the retrievedmodules using information derived frombothword-frequency dataand network-based properties. Our approach can expand researchinto intelligently mining social media data for consumer opinionof various treatments to provide rapid, up-to-date information forthe pharmaceutical industry, hospitals, and medical staff, on theeffectiveness (or ineffectiveness) of future treatments.Index Terms—Datamining, complex networks, neural networks,semantic web, social computing.I. INTRODUCTIONSOCIAL media is providing limitless opportunities for patientsto discuss their experiences with drugs and devices,and for companies to receive feedback on their products andservices [1]–[3]. Pharmaceutical companies are prioritizing socialnetwork monitoring within their IT departments, creatingan opportunity for rapid dissemination and feedback of productsand services to optimize and enhance delivery, increase turnoverand profit, and reduce costs [4]. Social media data harvestingfor bio-surveillance have also been reported [5].Social media enables a virtual networking environment.Modelingsocial media using available network modeling and computationaltools is one way of extracting knowledge and trendsfrom the information ‘cloud:’ a social network is a structuremade of nodes and edges that connect nodes in various relationships.Graphical representation is the most common methodto visually represent the information. Network modeling couldManuscript received January 24, 2014; revised May 4, 2014 and June 19,2014; accepted June 30, 2014. Date of publication July 10, 2014; date of currentversion December 30, 2014.A. Akay and B.-E. Erlandsson are with the School of Technology andHealth, Royal Institute of Technology, Stockholm SE-141 52, Sweden (e-mail:altu@kth.se; bjorn-erik.erlandsson@sth.kth.se).A. Dragomir, is with the Department of Biomedical Engineering, Universityof Houston, Houston, TX 77204–5060 USA (e-mail: adragomir@uh.edu).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JBHI.2014.2336251also be used for studying the simulation of network propertiesand its internal dynamics.A sociomatrix can be used to construct representations ofa social network structure. Node degree, network density, andother large-scale parameters can derive information about theimportance of certain entities within the network. Such communitiesare clusters or modules. Specific algorithms can performnetwork-clustering, one of the fundamental tasks in networkanalysis. Detecting particular user communities requires identifyingspecific, networked nodes that will allow informationextraction. Healthcare providers could use patient opinion toimprove their services. Physicians could collect feedback fromother doctors and patients to improve their treatment recommendationsand results. Patients could use other consumers’knowledge in making better-informed healthcare decisions.The nature of social networks makes data collection difficult.Several methods have been employed, such as link mining [6],classification through links [7], predictions based on objects[8], links [9], existence [10], estimation [11], object [12], group[13], and subgroup detection [14], and mining the data [15],[16]. Link prediction, viral marketing, online discussion groups(and rankings) allow for the development of solutions based onuser feedback.Traditional social sciences use surveys and involve subjectsin the data collection process, resulting in small sample sizes perstudy.With social media, more content is readily available, particularlywhen combined with web-crawling and scraping softwarethat would allow real-time monitoring of changes withinthe network.Previous studies used technical solutions to extract user sentimenton influenza [17], technology stocks [18], context andsentence structure [19], online shopping [20], multiple classifications[21], government health monitoring [22], specific termsrelating to consumer satisfaction [23], polarity of newspaper articles[24], and assessment of user satisfaction from companies[25], [26]. Despite the extensive literature, none have identifiedinfluential users, and how forum relationships affect networkdynamics.In the first stage of our current study, we employ exploratoryanalysis using the self-organizing maps (SOMs) to assess correlationsbetween user posts and positive or negative opinionon the drug. In the second stage, we model the users and theirposts using a network-based approach. We build on our previousstudy [27] and use an enhanced method for identifying usercommunities (modules) and influential users therein. The currentapproach effectively searches for potential levels of organization(scales) within the networks and uncovers dense modules2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 211Fig. 1. Processing tree in Rapidminer to ascertain the TF-IDF scores of wordsin the datausing a partition stability quality measure [28]. The approach enablesus to find the optimal network partition. We subsequentlyenrich the retrieved modules with word frequency informationfrom module-contained users posts to derive local and globalmeasures of users opinion and raise flag on potential side effectsof Erlotinib, a drug used in the treatment of one of the mostprevalent cancers: lung cancer [29].II. METHODSA. Initial Data Search and CollectionWe first searched for the most popular cancer message boards.We initially focused on the number of posts on lung cancer. Thechart below gives the number of posts of lung cancer per forum:Forums Posts on Lung CancerCancer-forums.net 36 051cancerforums.net 34 328forums.stupidcancer.org 17csn.cancer.org/forum 7959We chose lung cancer because, according to the most recentstatistics, it is the most commonly diagnosed cancer in theworld for both sexes [30], and the second most prevalent cancerin the US between both sexes [31], [32]. We then compiled alist of drugs used by lung cancer patients to ascertain whichdrug was the most discussed in the forums. The drug Erlotinib(trade name Tarceva) was the most frequently discussed drugin the message boards. A further search revealed that Cancerforums.net, despite having slightly fewer posts on lung cancer, hadmore posts dedicated to Erlotinib than the other three messageboards mentioned above.Next, we performed a search of the drug, using both thetrade name (Tarceva) and drug name (Erlotinib). The trade namegarnered more results (498) compared to the drug name (66).The search using the trade name returned 920 posts, from 2009to the present date.B. Initial Text Mining and PreprocessingA Rapidminer (www.rapidminer.com) [33] data collectionand processing tree was developed to look for the most commonpositive and negative words, and their term-frequency-inversedocument frequency (TF-IDF) scores within each post. Fig. 1shows the data collection and processing tree. We initially uploadedthe data into the first component (‘Read Excel’). Theuploaded data was then processed in the second component(‘Process Documents to Data’) using several subcomponents(‘Extract Content’, ‘Tokenize’, ‘Transform Cases’, ‘Filter Stopwords’,‘Filter Tokens,’ respectively) that filtered excess noise(misspelled words, common stop words, etc.) to ensure a uniformset of variables that can be measured. The final component(‘Processed Data’) contained the final word list, with each wordcontaining a specific TF-IDF score.We then assigned weights for each of the words found in theuser posts using with the following formula:weighti,d_log tfi,d + 1) log nxt0if tft,d ≥ 1otherwisein which tfi,d represents the word frequency (t) in the document(d), n represents the number of documents within the entirecollection, and xt represents the number of documents where toccurs [30].C. Cataloging and Tagging Text DataText data containing the highest TF-IDF scores were taggedwith a modified NLTK toolkit (http://www.nltk.org/) [34] usingMATLAB to ensure that they reflected the negativity of a negativeword and the positivity of a positive word in context. Thisapproach was used before using negative tags on positive words[35]. We added a positive tag on negative words. We used theNLTK toolkit for the analysis, and classification, of words tomatch their exact meanings within the contextual settings. Forexample, the context should be considered in phrases such as ‘Ido not feel great’ so that the term ‘great’ would be adequatelytagged as a negative one (in our case it is tagged as ‘great_n’before it is returned to its specific position). Das and Chen useda similar approach in classifying words [18]. We went one stepfurther and considered positive tag on negative words. A sentencethat states ‘No side effects so I am happy!’ resulted in theword ‘No’ being tagged as ‘No_p’ (reflecting its positive context)before it is returned to its specific position. These taggedwords were thus reclassified based on the context of the post.We then reduced the number of similar words, both manually(checking the words using online dictionaries such asMerriam-Webster (http://www.merriam-webster.com/), and automatically(synonym database software such as the ThesaurusSynonym Database (http://www.language-databases.com/) andGoogle’s synonym search finder.Our finalwordlistwas pruned using the aforementioned methods,with the results displayed in Table I, with the division ofboth positive and negative words.We eliminated each word that appeared less than ten times.This allowed us to achieve a uniform set of measurements whileeliminating statistically insignificant outliers. The end resultwas a modified wordlist of 110 words (55 positive words and55 negative words) shown in Table I.In a parallel procedure, we automatically browsed the userposts to look for side effects of Erlotinib. To this goal, weused the National Library of Medicine’s Medical SubjectHeading (MeSH), which is a controlled vocabulary212 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IFINAL POSTANALYSIS WORDLISTPositive NegativeAgree BadAppreciate CannotBeneficial ConcernBenefit ConcernsComfort DamageComfortable DangerousEase DepressionEasier DidnEffective DiedEnjoy DifficultFavorable DiscomfortFavorably Don’tFeasible DoubtGood ErrorGrateful FailureGreat FearGreater HardGreatest HasnGreatly HateHelp HurtHelped ImpossibleHelpful IsnHelping LackHelps LimitedHonest LoseHonestly LossHope LostHoped MissHopeful NastyHopefully NauseaHopes NegativeHoping NoImportance NotImportant PainImportantly PainfulImpresses PoorImprove ProblemImproved ProblemsImprovement SadImproves SacredImproving ScaryInspiration SevereLike SorryLove SucksLoved SufferPositive SufferingRight TerribleSuccess UnableSuccessful UnfortunatelySupport WasnThank WeakThanks WorriedUseful WorseWell WorstWonderful Wrong(http://www.nlm.nih.gov/mesh/) that consists of a hierarchy ofdescriptors and qualifiers that are used to annotate medicalterms. A custom designed program was used to map wordsin the forum to the MeSH database. A list of words present inforum posts that were associated to treatment side effects wasthus compiled. This was done by selecting the words simultaneouslyannotated with a specific list of qualifiers in MeSH (CI– chemically induced; CO – complications; DI – diagnosis; PA– pathology, and PP – physiopathology).We then compared theTABLE IIFINAL SIDE EFFECTS WORDLISTAcneCachexiaHeadachesItchingLesionPneumoniaRashTremorWeaknessVomitingFig 2. Thread model where nodes represent users/posts and the edges representinformation transferred among users.full list of side effect words with the results that were fed into theRapidminer processing tree: we kept the side effect words withthe highest TF-IDF scores (ensuring that each word appeared atleast ten times in the forum posts).Table II shows the final wordlist of the side effects. We subjectedthe initial side effect wordlist with the same methods thatwere used in Table I.After these preprocessing steps, our forum data was representedas two sets of vectors containing the TF-IDF scores ofthe words in the two wordlist. Namely, each user post in theforum was thus transformed into a vector of 110 variables representingthe TF-IDF scores of positive and negative words, anda 10 variable vector containing the TF-IDF scores correspondingto the side effect terms (see Fig. 2, steps A1-A3).D. Consumer Sentiment Using a SOMFor this part of the analysis, all posts were manually labeledaccording to the general user opinion observed within the postas positive and negative before feeding the collected data forexploratory analysis via SOMs. The manual labeling allowed usto use this as a method of results validation.SOMs are neural networks that produce low-dimensional representationof high-dimensional data [33]. Within this network,a layer represents output space with each neuron assigned a specificweight. The weight values reflect on the cluster content.The SOM displays the data to the network, bringing togethersimilar data weights to similar neurons.AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 213The benefits and capabilities have been demonstrated wheredespite the reduction of the space size, the information, andidentification schema of the clusters remained the same [36].When new data is fed into the network, the closest weightsmatching the data change to reflect the new data. The neuronsfarther from the new data rarely change. This process continuesuntil data is no longer fed, resulting in a two-dimensional map.The SOM toolbox (www.cis.hut.fi/projects/somtoolbox) [37]was used and the SOM was fed with our first wordlist (seeTable I) TF-IDF vectors. The purposewas to assess the existenceof clusters in the data and howtheSOMweights of these clusterswould correlate to positive and negative opinion. The SOMwas trained using various map sizes, using quantization andtopographic errors as validation measures. The former is theresult of the average distance between every input vector andits best matching neuron (BMN), in addition to measuring howthe trained map fits into the input data [33]. The latter uses thestructure of the map to preserve its topology by representingits accuracy: it is calculated using the proportion of the weightsfor the first and second BMNs are farther than required formeasuring the topology.The best map size was based on the minimum values of thequantization (0.24) and topographic (105) errors. The wordlistdata was mapped and the emerging weights were analyzed forpositive and negative variable correlations of thewordlist.Wordsof no interest, and groups containing three or fewer words, wereeliminated.Subgroups were visually identified and analyzed for furtherinformation on the consumer opinion of Erlotinib.E. Modeling Forum Postings Using Network AnalysisDiscovering influential users was the next step in our analysis.To this goal, we built networks from forum posts andtheir replies, while accounting for content-based grouping ofposts resulting from the existing forum threads. Networks arecomposed of nodes and their connections: they are either nondirectional(a connection between two nodes without a direction)or directional (a connection with an origin and an end). Thenodal degree of the latter measures the number of connectionsfrom the origin to the destination. Four node types have beenidentified [38] within a network: Isolated, transmitter, receptor,and carrier. The network’s density measures the current numberof connections.The network-based analysis is widely used in social networkanalysis based on its ability to both model and analyze intersocialdynamics. We devised a directional network model due tothe nature of the forum under scrutiny (multiple threads withmultiple thread initiators) and its internal dynamics among themembers (members reply to thread initiators as well as to otherusers). Fig. 2 describes the approach we chose to build ournetwork, which shows how each posting-reply pair is modeled.Based on the nature of the forum, all of the posters within eachthread are context posters for the thread initiator (e.g., Node 1 isthe thread initiator in Fig. 2 and Nodes 2, 3, 4, and 5 in representcontext posters). Thus, all of the posters receive an incomingedge from the thread initiator. Some context posters respondFig. 3. Diagram describing the framework of our network-based analysis.First, the posts collected from the forum via Rapidminer are preprocessed usingthe NTLK Toolbox (Step A1) and transformed into two wordlists (Step A2). Forthis step, direct mapping to the MeSH vocabulary is used to identify words representingside-effects Based on the two wordlists, forum posts are transformedinto numerical vectors containing word-frequency based TF-IDF scores (StepA3). In parallel, forum posts and replies aremodeled as a directed network (StepB1). Obtained network is further refined to identify communities/modules ofhighly interacting users, based on the MCSD method [28] (Step B2). Finally,the two wordlist vectors datasets (their info reflecting the forum informationcontent) are overlaid onto the network modules to identify influential users andhighlight side-effects intensively discussed within the modules, respectively(Step B3).directly to another poster, using the forum option ‘Reply.’ Weused bidirectional edges to reflect the ensuing information transferfrom the poster to the replier and vice versa (in Fig. 2, Node5 is a direct replier to Node 4, as is Node 3 to Node 2). Thisuser-interactionmodel allowed us to build a network that reflectsfaithfully the information content of the forum.F. Identifying SubgraphsOur modeling framework has consequently converted the forumposts into several large directional networks containing anumber of densely connected units (or modules) (see Fig. 3,step A1). These modules have the characteristic that they aremore densely connected internally (within the unit) than externally(outside the unit). We chose a multiscale method thatuses local and global criteria for identifying the modules, whilemaximizing a partition quality measure called stability [28].The stability measure considers the network as a Markovchain, with nodes representing states and edges being possibletransitions among these states. In [28], the authors proposed anapproach in which transition probabilities for a random walk oflength t (t being the Markov time) enable multiscale analysis.With increasing scale t, larger and larger modules are found.The stability of a walk of length t can be expressed asQMt =12m_i,j_Ati , j− didj2m_∗ δ (i, j) (1)where At is the adjacency matrix, t is the length of the network,m is the number of edges, i and j are nodes, di is node i’s (and j’s)strength, and δ (i,j) function becomes one if one of the nodesbelong to the same network and zero if it does not belong to214 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015any network. At is computed as follows (in order to accountfor the random walk): At = D ·Mt , where M = D−1 · A (Dbeing the diagonal matrix containing the degree vector givingfor each node its degree) [28].The method for identifying the optimal modules is based onalternating local and global criteria that expand modules byadding neighbor nodes, reassigning nodes to different modules,and significantly overlapping modules until no further optimizationis feasible, according to (1). The approach follows similarmethods presented in [28], [39], and [40].Several partitioning schemes were obtained pending on therange of scales employed by the method, with the optimal partitioninghaving the largest stability. We named the modules thusretrieved information modules (see step A2 in Fig. 3).G. Module Average Opinion and User Average OpinionWe then proceeded to refine the information modules throughfeeding them with the information obtained from the forumposts (using the wordlist vectors). In a first step, we aimedat identifying influential users within our networks. Influentialusers are users which broker most of the information transferwithin network modules and whose opinion in terms of positiveor negative sentiment towards the treatment is ‘spread’ tothe other users within their containing modules. To this goal,we enriched the information modules obtained as described inSection II-F with the TF-IDF scores of the user posts correspondingto the users found in each module. The TF-IDF scoresfrom the wordlist of positive and negative words (see Table I)were used to build two forms of measurement. The global measure(pertaining to the whole informationmodule) is representedby the module average opinion (MAO). It examined the TF-IDFscores of postings matching the nodes in a specific moduleMAO =Sum+ SumSumall.Sum+ =__xij is the total sum of the TF-IDF scoresmatching the positive words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the positive words in the list).Sum=__xij is the total sum of the TF-IDF scoresmatching the negative words in the wordlist vectors within themodule. The units i represent post index. The unit j representsthe wordlist index (matching the negative words in the list).Sumall =_Ni=1_Mk=1 xik is the sum of both of the aforementionedsums. The unit k is the index running across variablesthroughout the entire wordlist.The local measure that illustrates specific user opinion toeach node in the module (the user average opinion, or UAO)that examines the TF-IDF scores to the average of the collectedposts of the user is the following:UAOi =Sumi+ Sumi−Sumiall.Sumi+ =_j∈P xij is the TF-IDF score’s sum matching topositive words for the ith user’s wordlist vector. P is the indexset denoting the wordlist’s positive variables.Fig. 4. U-Matrix of the posts from the Cancerforums.net forum.Sumi− =_j∈N xij is the TF-IDF score’s sum matching tonegative words for the ith user’s wordlist vector. N is the indexset denoting the wordlist’s negative variables.Sumall =_Mj=1 xij is the total of both sums, and j is theindex of the whole wordlist.H. Information Brokers Within the Information ModulesWe first ranked individual nodes in terms of their total numberof connecting edges (in and out-degree) to identify influentialusers within the modules.We then looked nodes in each module based on the followingcriteria:1) The nodes have densest degrees within the module (highestnumber of edges).2) The UAO scores equate the signs of the MAO of thecontaining module.The nodes that qualified were dubbed information brokers,based on the aforementioned criteria. Their large nodal degreesensure increased information transfer compared to other nodeswhile their matching UAO and MAO scores reflect consistencyof positive or negative opinion within the containing module.I. Network-Based Identification of Side EffectsIn the second step of our network-based analysis, we devised astrategy for identifying potential side effects occurring duringthe treatment and which user posts on the forum highlight. Tothis goal, we overlay the TF-IDF scores of the second wordlist(see Table II) onto modules obtained in Section II-F. The TFIDFscores within each module will thus directly reflect howfrequent a certain side-effect is mentioned in module posts.Subsequently, a statistical test (such as the t-test for example)can be used to compare the values of the TF-IDF scores withinthe module to those of the overall forum population and identifyvariables (side-effects) that have significantly higher scores.Fig. 3 presents a diagram that visually describes the steps inour network-based analysis.III. RESULTSFig. 4 shows the unified matrix resulting from the SOM analysisfor the wordlist vectors corresponding to the positive andnegative terms from the message board Cancerforums.net. Asubset consisting of 30% of the data was used for training theSOM. We used a 12 × 12 map size with 110 variables correspondingto the positive and negative terms to ascertain theAKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE 215TABLE IIIUSER OPINION OF ERLOTINIBSatisfaction Dissatisfaction70 percent 30 percentBREAKDOWN OF USER OPINIONFully Satisfied (23) Full Dissatisfaction (4)Satisfied Despite Side Effects (37) Dissatisfaction because of Side Effects (20)Satisfied Despite Costs (10) Dissatisfaction because of Costs (6)weight of the words corresponded to the opinion of the drugErlotinib. As mentioned in Section II, each word from the listappeared more than ten times. This achieved a uniform measurementset while eliminating statistically insignificant outliers.Much of the user’s posts converged on three areas of the map.We checked the respective nodes’ correlation with their weightvectors’ values corresponding to positive or negative words todefine the positive and negative areas of the map.The user opinion of Erlotinib was overall satisfactory, withTable III summarizing the satisfaction/dissatisfaction below:According to chart, and from our readings of both the userposts and the SOM, the most pressing concern from both campswas the side effects, which are extensively documented in themedical literature [41]–[46]. The costs of the drug were alsoanother matter of concern (albeit limited).We then proceeded to identify influential users. Our modelingapproach yielded initially a single loosely connected network,linking all users within the forum. Subsequent module identificationusing the methods described in Section II-F yielded anoptimal partitioning containing five densely connected module.We varied our scale parameter within the interval t _ [0,2] in0.1 increments, as suggested by [28]. Varying the scale parameterresulted in a set of partitions ranging from modules basedon single individual users (for scale parameter t = 0), to largemodules (for values of t close to the upper limit of the interval).The optimal partition (maximizing the quality measure in (1)was obtained for t = 1.On the Cancerforums.net message board, ten users out of the920 posts were identified as information brokers as shown inFig. 5(a)–(e) below.Densities of the retrieved modules range from 0.2 to 0.6.These density values were within the observed density valuesinterval (towards the upper limit), when compared to those generallynoted in social networks, thus confirming our networkmodeling approach [47], [48]. Information brokers were identifiedfollowing the procedure described in Sections II-G–H.Further scrutinizing these users and their containing moduleswe confirmed their connections were the densest. A thoroughreading of these ten users’ posts throughout the threads theystarted and participated in revealed that they were informativeand actively interacting with users across many threads. Othermembers sought out these ten posters for their wisdom andexperience. Their forum ‘behavior’ has confirmed to us thatthese users were the premier information brokers of the drugsErlotinib on the Cancerforums.net forum.Fig. 5. Ten users were identified as information brokers on the Cancerforums.net Forum. Modules in parts a)–e) show where these ten users reside inthe forum.216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015TABLE IVSIDE-EFFECT FREQUENCY AND LOCATION IN SELECTED MODULESModule 1 (A) ‘rash’ (p − value < 0.01)‘itch’ (p − value < 0.05)Module 2 (B) ‘rash’ (p − value < 0.05)Module 5 (E) ‘rash’ (p − value < 0.01)In the last part of our analysis,we investigated whichmoduleswere significantly involved in discussing specific side effects.As described in Section II-I, retrieved modules were enrichedwith the TF-IDF scores corresponding to the side-effectwordlistvectors. For each module and each side-effect scores sample, ttestswereperformed to assess the significant difference betweenthe in-module sample and the overall forum population scores.Rash and itching were identified as the side effect terms withsignificantly higher scores in Modules 1, 2, and 5 when comparedto the overall scores population in the forum, as describedin Table IV. This reflects the fact that users grouped within thesemodules repeatedly discussed these side effects in their posts.This was confirmed by subsequent scrutiny of the respectiveposts. A literature search confirmed that rash and itching areindeed two of the most common side-effects of Erlotinib withas much as 70% of the patients affected, as indicated by clinicalstudies. [44]–[46]IV. DISCUSSIONWe converted a forum focused on oncology into weightedvectors to measure consumer thoughts on the drug Erlotinibusing positive and negative terms alongside another list containingthe side effects. Our methods were able to investigatepositive and negative sentiment on lung cancer treatment usingthe drug by mapping the large dimensional data onto a lowerdimensional space using the SOM. Most of the user data wasclustered to the area of themap linked to positive sentiment, thusreflecting the general positive view of the users. Subsequent networkbased modeling of the forum yielded interesting insightson the underlying information exchange among users. Modulesof strongly interacting users were identified using a multiscalecommunity detection method described in [28]. By overlayingthese modules with content-based information in the form ofword-frequency scores retrieved from user posts, we were ableto identify information brokers which seem to play importantroles in the shaping the information content of the forum. Additionally,we were able to identify potential side effects consistentlydiscussed by groups of users. Such an approach could beused to raise red flags in future clinical surveillance operations,as well as highlighting various other treatment related issues.The results have opened new possibilities into developing advancedsolutions, as well as revealing challenges in developingsuch solutions.The consensus on Erlotinib depends on individual patientexperience. Social media, by its nature, will bring different individualswith different experiences and viewpoints. We siftedthrough the data to find positive and negative sentiment, whichwas later confirmed by research that emerged regarding Erlotinib’seffectiveness and side effects. Future studies will requiremore up-to-date information for a clearer picture of userfeedback on drugs and services.Future solutions will require more advanced detection of intersocialdynamics and its effects on the members: such interestsof study may include rankings, ‘likes’ of posts, and friendships.Further emphasis on context posting will require formal languagedictionaries that include medical terms for specific diseases,and informal language terms (‘slang’) to clarify posts.Finally, different platforms will allow up-to-date informationon the status of the drug in case one social platform ceases todiscuss the drug. Another solution can look at multiple wordliststhat can include multiple treatments that, when combined withcontextual posting and medical lexical dictionaries, can pinpointthe source (or multiple sources) of user satisfaction (ordissatisfaction), which can open the door towards mapping consumersentiment of multidrug therapies for advanced diseases.The combined solutions can open newavenues of postmarketingsurveillance research as companies seek real-time, ‘intelligent’data of their products and services to remain competitive.This solution can be envisioned on future medical devicesthat can serve as postmarketing feedback loop that consumerscan use to express their satisfaction (or dissatisfaction) directlyto the company. The company benefits from real-time feedbackthat can then be used to assess if there are any problems andrapidly address such problems.Social media can open the door for the health care sector inaddress cost reduction, product and service optimization, andpatient care.