Social media has transformed the way we communicate and share information. Microblogging websites, such as Twitter, have emerged as a powerful tool for individuals and businesses to express their opinions, complaints, and current issues in real-time. These platforms have become a goldmine of information for marketers who are constantly searching for ways to understand customer behavior and preferences. In fact, many companies now use microblogging websites as a source of data to measure the generalized sentiments towards their products or services.
In this post, we take a closer look at how to analyze Twitter feeds in real-time using the R programming language. We explore how to classify tweets into positive, negative, and neutral sentiments using data science techniques. This will enable us to create a histogram and perform word cloud analysis to gain insights into the sentiments of Twitter users.
In our previous post, Getting started with Live Twitter Data and Trends in R-Project, we showed how to extract live Twitter feeds. In this post, we build on that work and show how to analyze live Twitter feeds mathematically.
To get started,
Step 1:- Install R-Project on your PC.
Step 2:- Install and Load the required libraries as stated below
list.of.packages <- c("SnowballC", "wordcloud", "tm", "stringr","plyr", "ggplot2", "twitteR") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) library('SnowballC') library('wordcloud') library('tm') library('stringr') library('plyr') library('ggplot2') library('twitteR')
Step 3:- Connect R with Twitter – if you need help Please click here.
your_api_key <- "YOUR API KEY" your_api_secret <- "YOUR API SECRET" your_access_token <- "YOUR ACCESS TOKEN" your_access_token_secret <- "YOUR ACCESS TOKEN SECRET" setup_twitter_oauth(your_api_key,your_api_secret,your_access_token,your_access_token_secret)
Step 4:- Import Data from Twitter and normalize it for analysis by using the code below. In this example, we are importing 1000 LIVE feeds related to Indian GDP. You can change the variable “key_search” as per your requirements.
key_search = "IndiaGDP" insta<- searchTwitter(key_search, n=1000, lang="en") insta_text <- sapply(insta, function(x) x$getText()) insta_text_corpus <- Corpus(VectorSource(insta_text)) insta_text_corpus <- tm_map(insta_text_corpus, removePunctuation) insta_text_corpus <- tm_map(insta_text_corpus, content_transformer(tolower)) insta_text_corpus <- tm_map(insta_text_corpus, function(x)removeWords(x,stopwords())) insta_text_corpus <- tm_map(insta_text_corpus, removeWords, c("RT", "are","that")) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) insta_text_corpus <- tm_map(insta_text_corpus, content_transformer(removeURL)) insta_2 <- TermDocumentMatrix(insta_text_corpus) insta_2 <- as.matrix(insta_2) insta_2 <- sort(rowSums(insta_2),decreasing=TRUE) insta_2 <- data.frame(word = names(insta_2),freq=insta_2) head(insta_2, 10) set.seed(1234)
Step 5:- Import Positive and Negative word list.
Click to download Positive words CSV file.
Click to download Negative words CSV file.
Please download the above files in the working directory of R-Project. To know your working directory, please type getwd() in R-Console.
pos.words <- read.csv("positive-word.csv") neg.words <- read.csv("negative-word.csv") pos.words <- scan("positive-word.csv",what = 'character') neg.words <- scan("negative-word.csv",what = 'character')
Step 6:- Let us create a function to normalize the LIVE Twitter Data and match them against the Positive and Negative word lists.
score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list # or a vector as an "l" for us # we want a simple array ("a") of scores back, so we use # "l" + "a" + "ply" = "laply": scores = laply(sentences, function(sentence, pos.words, neg.words) { # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence) # split into words. str_split is in the stringr package word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score.p = sum(pos.matches) score.n = sum(neg.matches) score = score.p - score.n return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }
Step 7:- Viewing the scores and summary, please use the code below.
test <- ldply(insta,function(t) t$toDataFrame() ) result <- score.sentiment(test$text,pos.words,neg.words) summary(result$score)
Step 8:- To plot the word cloud, please use the code below.
wordcloud(insta_text_corpus,min.freq=1,max.words=80,scale=c(2.2,1), colors=brewer.pal(8, "Dark2"))
Step 9:- To plot the histogram of the results, please use the code below.
qplot(result$score,xlab = "Score of tweets", ylab=key_search)
To view the frequency count, please use the code below.
count(result$score)
You must log in to post a comment.