Yatharth Text Analytics: How to perform Twitter analytics in R

There is a R package called as “sentiment” by Timothy Jurka. The package has 2 best functions so far which are as follows-

classify_emotion

Classify_emotion classifies the information into following terms –

anger,
disgust,
fear,
joy,
sadness, and
surprise

This is done by two main algorithms –

naive Bayes classiﬁer (which is trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon)
the other one is just a simple voter procedure.

classify_polarity

In contrast to the classification of emotions, the classify_polarity function allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm.

Important Note:

The R package “sentiment” depends on Duncan’s Temple Rstem package that is only available at Omegahat

At the time of this writing, I’m using the version 0.4-1 (I downloaded and installed the tar.gz file from the package website).

Example with tweets talking about “starbucks”

Step 1: Load the necessary packages

# required pakacges

library(twitteR)

library(sentiment)

library(plyr)

library(ggplot2)

library(wordcloud)

library(RColorBrewer)

Step 2: Let’s collect some tweets containing the term “starbucks”

# harvest some tweets

some_tweets = searchTwitter("starbucks", n=1500, lang="en")

# get the text

some_txt = sapply(some_tweets, function(x) x$getText())

Step 3: Prepare the text for sentiment analysis

# remove retweet entities

some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

# remove at people

some_txt = gsub("@\\w+", "", some_txt)

# remove punctuation

some_txt = gsub("[[:punct:]]", "", some_txt)

# remove numbers

some_txt = gsub("[[:digit:]]", "", some_txt)

# remove html links

some_txt = gsub("http\\w+", "", some_txt)

# remove unnecessary spaces

some_txt = gsub("[ \t]{2,}", "", some_txt)

some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function

try.error = function(x)

{

# create missing value

y = NA

# tryCatch error

try_error = tryCatch(tolower(x), error=function(e) e)

# if not an error

if (!inherits(try_error, "error"))

y = tolower(x)

# result

return(y)

}

# lower case using try.error with sapply

some_txt = sapply(some_txt, try.error)

# remove NAs in some_txt

some_txt = some_txt[!is.na(some_txt)]

names(some_txt) = NULL

Step 4: Perform Sentiment Analysis

# classify emotion

class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)

# get emotion best fit

emotion = class_emo[,7]

# substitute NA's by "unknown"

emotion[is.na(emotion)] = "unknown"

# classify polarity

class_pol = classify_polarity(some_txt, algorithm="bayes")

# get polarity best fit

polarity = class_pol[,4]

Step 5: Create data frame with the results and obtain some general statistics

# data frame with results

sent_df = data.frame(text=some_txt, emotion=emotion,

polarity=polarity, stringsAsFactors=FALSE)

# sort data frame

sent_df = within(sent_df,

emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

This what the first 15 rows of sent_df would look like

Step 6: Let’s do some plots of the obtained results

# plot distribution of emotions

ggplot(sent_df, aes(x=emotion)) +

geom_bar(aes(y=..count.., fill=emotion)) +

scale_fill_brewer(palette="Dark2") +

labs(x="emotion categories", y="number of tweets") +

opts(title = "Sentiment Analysis of Tweets about Starbucks\n(classification by emotion)",

plot.title = theme_text(size=12))

# plot distribution of polarity

ggplot(sent_df, aes(x=polarity)) +

geom_bar(aes(y=..count.., fill=polarity)) +

scale_fill_brewer(palette="RdGy") +

labs(x="polarity categories", y="number of tweets") +

opts(title = "Sentiment Analysis of Tweets about Starbucks\n(classification by polarity)",

plot.title = theme_text(size=12))

Step 7: Separate the text by emotions and visualize the words with a comparison cloud

# separating text by emotion

emos = levels(factor(sent_df$emotion))

nemo = length(emos)

emo.docs = rep("", nemo)

for (i in 1:nemo)

{

tmp = some_txt[emotion == emos[i]]

emo.docs[i] = paste(tmp, collapse=" ")

}

# remove stopwords

emo.docs = removeWords(emo.docs, stopwords("english"))

# create corpus

corpus = Corpus(VectorSource(emo.docs))

tdm = TermDocumentMatrix(corpus)

tdm = as.matrix(tdm)

colnames(tdm) = emos

# comparison word cloud

comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),

scale = c(3,.5), random.order = FALSE, title.size = 1.5)

Yatharth Text Analytics

Total Pageviews

Monday, 10 August 2015

How to perform Twitter analytics in R

No comments:

Post a Comment