Uploaded by Yukta

SMA EXP 2023

advertisement
EXPERIMENT NO. 1.
Aim: Study various i)
ii)
iii)
iv)
Social Media platforms (Facebook, twitter, YouTube etc.)
Social Media analytics tools (Facebook insights, google analytics net lytic etc.)
Social Media Analytics techniques and engagement metrics (page level, post level, member level)
Applications of Social media analytics for business. e.g. Google Analytics
https://marketingplatform.google.com/about/analytics/
https://netlytic.org
Theory:
Social media has become an indispensable part of our lives. It has transformed the way we communicate
and interact with each other. Social media platforms like Facebook, Twitter, and YouTube have millions of
users who share their views, opinions, and experiences. Businesses have recognized the importance of social
media in engaging with their customers and promoting their products and services. Social media analytics is the
process of analyzing social media data to gain insights into customer behavior, market trends, and brand
perception. In this article, we will study various social media platforms, social media analytics tools, and
techniques, and their applications in business.

Social Media Platforms:
There are various social media platforms, and each has its own unique features and
characteristics. Some of the popular social media platforms are Facebook, Twitter, Instagram, LinkedIn,
and YouTube.
Facebook is the most popular social media platform, with over 2 billion monthly active users.
Twitter is a microblogging platform that allows users to post short messages called tweets. Instagram is
a visual platform that focuses on photos and videos. LinkedIn is a professional networking platform that
connects professionals and businesses. YouTube is a video-sharing platform that allows users to upload
and share videos.

Social Media Analytics Tools:
Social media analytics tools are software applications that help businesses track, measure, and
analyze social media data. Some of the popular social media analytics tools are Facebook Insights,
Google Analytics, and Netlytic.
Facebook Insights is a free tool that provides businesses with data on their Facebook page's
performance. Google Analytics is a web analytics tool that tracks and reports website traffic. Netlytic is
a social media analytics tool that analyzes social media data to identify trends, topics, and influencers.

Social Media Analytics Techniques and Engagement Metrics:
Social media analytics techniques and engagement metrics are used to measure social media
performance. Some of the popular social media analytics techniques and engagement metrics are pagelevel metrics, post-level metrics, and member-level metrics.
Page-level metrics measure the overall performance of a social media page. Post-level metrics
measure the performance of individual posts. Member-level metrics measure the engagement level of
individual members.

Applications of Social Media Analytics for Business:
Social media analytics has various applications in business. It can be used to measure brand
awareness, customer engagement, and market trends. It can also be used to identify influencers and track
competitor activity. Google Analytics is a popular tool used by businesses to track website traffic and
user behavior.
It provides businesses with insights into their website's performance, such as the number of
visitors, bounce rate, and conversion rate. Netlytic is a social media analytics tool that helps businesses
identify trends, topics, and influencers in social media data.
Conclusion:
Social media analytics has become an essential tool for businesses to measure their social media
performance and gain insights into customer behavior, market trends, and brand perception.
Social media platforms like Facebook, Twitter, and YouTube have millions of users, and businesses can
leverage this audience to promote their products and services.
Social media analytics tools like Facebook Insights, Google Analytics, and Netlytic provide businesses
with data and insights to make informed decisions.
Social media analytics techniques and engagement metrics can help businesses measure their social
media performance and optimize their social media strategy.
EXPERIMENT NO. 2.
Aim: Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn, YouTube,
Web blogs etc) ,connect to and capture social media data for business ( scraping, crawling, parsing).
Introduction:
Social media has become a crucial source of data for businesses to understand their customers and their
preferences. With millions of users sharing their thoughts, opinions, and experiences on social media platforms
like Twitter, Facebook, LinkedIn, YouTube, and web blogs, businesses can gain valuable insights into customer
behavior, market trends, and brand perception. In this article, we will discuss data collection methods for social
media platforms, including scraping, crawling, and parsing.
 Data Collection Methods:
There are various data collection methods for social media platforms, and each method has its own advantages
and disadvantages. Some of the popular data collection methods are scraping, crawling, and parsing.
1. Scraping:
Scraping is the process of extracting data from websites or social media platforms. Scraping is an
effective way to collect data from social media platforms like Twitter, Facebook, and LinkedIn. The data
collected through scraping can be used for various purposes, such as sentiment analysis, trend analysis,
and competitor analysis.
2. Crawling:
Crawling is the process of discovering and indexing web pages through search engines. It
involves using automated tools or software to scan websites and social media platforms for new content.
Crawling is an effective way to collect data from web blogs and news websites. The data collected
through crawling can be used for various purposes, such as content analysis, trend analysis, and
competitor analysis.
3. Parsing:
Parsing is the process of extracting data from structured or unstructured data sources. It involves
using automated tools or software to analyze data and extract relevant information. Parsing is an
effective way to collect data from social media platforms like Twitter and Facebook. The data collected
through parsing can be used for various purposes, such as sentiment analysis, trend analysis, and
competitor analysis.
Execution Steps:
1. Identify the social media platform(s) of interest: Choose the social media platform(s) that you want to
collect data from, based on your business needs and objectives.
2. Set up a data collection tool: Choose a data collection tool that suits your needs and set it up. There are
various data collection tools available, such as Octoparse, ParseHub, and Beautiful Soup.
3. Define the data fields: Define the data fields that you want to collect from the social media platform. For
example, if you are collecting data from Twitter, you may want to collect data on tweets, hashtags, user
profiles, and location.
4. Configure the data collection tool: Configure the data collection tool to collect the data fields that you
have defined.
5. Run the data collection tool: Run the data collection tool to collect the data from the social media
platform.
6. Store and analyze the data: Store the data in a database or spreadsheet and analyze it using data analysis
tools like Excel or Python.
Code:
# Import required libraries
import pandas as pd
import requests
from textblob import TextBlob
# Set YouTube video ID, maximum number of comments to retrieve, and API key
video_id = "Q33TkQKlIMg"
max_result = 50
api_key ="AIzaSyC_4xZTiNuz1O-Qu5kYnlg82riP30KRIxY"
# Retrieve video information
video_info_url = f"https://www.googleapis.com/youtube/v3/videos?part=id%2C+snippet&id
=Q33TkQKlIMg&key={api_key}"
video_info_response = requests.get(video_info_url)
video_info_data = video_info_response.json()
# Retrieve video components (comments)
comments_url = f"https://www.googleapis.com/youtube/v3/commentThreads?key={api_key}&v
ideoId={video_id}&part=snippet&maxResults={max_result}"
comments_response = requests.get(comments_url)
comments_data = comments_response.json()
# Create pandas DataFrame from comments data
df = pd.DataFrame(comments_data['items'])
df1 = pd.DataFrame(df['snippet'])
# Extract comments from DataFrame
txt=""
comments = []
for i in range(0,50):
df2 = pd.DataFrame(df['snippet'][i])
txt = df2['topLevelComment']['snippet']['textOriginal']
comments.append(txt)
print(comments)
# Define function to perform sentiment analysis on a given comment
def get_comment_sentiment(comment):
analysis = TextBlob(comment)
if analysis.sentiment.polarity > 0:
return "Positive"
elif analysis.sentiment.polarity == 0:
return "neutral"
else:
return "negative"
# Perform sentiment analysis on all comments and create a new DataFrame
comment_list = []
sentiment_list = []
for comment in comments:
sentiment = get_comment_sentiment(comment)
comment_list.append(comment)
sentiment_list.append(sentiment)
print(f"{comment} : {sentiment}")
sentiment_df = pd.DataFrame({"Comments": comment_list,"Sentiment": sentiment_list})
sentiment_df.head()
# Save DataFrame to a CSV file
sentiment_df.to_csv("Ashish_YouTube_Comments_Sentiment.csv")
Output:
golden voice morgan freeman : Positive
It's shows only our earth not Universe : neutral
"Little minds are tamed and subdued by misfortune; but great minds rise above it." -Washington Irving : Positive
"I find that when you have a real interest in life and a curious life, that sleep is
not the most important thing." --Martha Stewart : Positive
But they can't beat the @melodysheep : neutral
Hore..hore film ultramen!! : neutral
I mean.. life is strange : negative
So amazing 😢 : Positive
N for nothing original : Positive
His narration is institutionalized. ❤❤❤❤ : neutral
Morgan Freeman u beauty ❤ : neutral
All most of pieces come from France television watch a doc called "aux Frontier de
l'univers "I bet you will like it but also Morgan he ' s one of the best : Positive
Absolute poppycock!
It makes me love God all the more!
Trying to explain the universe while claiming God doesn’t exist…..
Let’s just say that in heaven, they watch shows like this on comedy night. : Positive
No Money, No Honey.
Blow up the damn universe. Who cares. : neutral
Ashish attenborough would’ve ate this up : neutral
And humans are biggest mistake by nature 😂😂😂😂 : neutral
I wish this could be not only on netflix and on other websites in the internet to
watch it or on tv in Poland where I live. I love Mr Freeman voice and science shows I
watched with him : Positive
I got goosebumps watching this trailer. : neutral
Half of trailer copy from national geographic channel😅 : negative
its not a documentry. its an exploration of theoretical events. : neutral
damn, the big bang arc hit hard.
when theia-chan and earth-kun finally joined together i was crying : negative
This proved to me that we have a CREATOR, this world was not made by chance, it was
intentional and perfect. : Positive
God is the creator, the universe things you put in here don't make sense, had to skip
the parts : neutral
Kudos to tech , We can see back the full history : Positive
This is not the real story..... : Positive
Seeing ur work gettin some recognition bring tears : neutral
watch The throne of Allah. : neutral
Honestly was praying NOT to hear Morgan Freeman : Positive
Is this the history of our universe or it's the history that you made up? : neutral
The earth is flat : negative
And also don't forget world's elite have an agenda to believe lies. GOD created
everything keep in mind, and this universe GOD laws still exist. Amen : neutral
Gurl y’all got God to narrate this lmaoo : neutral
BIDEN R U WATCHING THIS ? : neutral
Надеюсь нам не будут напоминать что планеты бесполые... : neutral
Watching the creation of our Lord God Almighty. He created all things and through the
Lord Jesus were made. The world was made in 6 days 💯❤✝️ : neutral
Looks so good : Positive
The voice of ''MORGAN FREEMAN'' is not just a voice it's an emotion that connects us
with him and also connects us with the scenes. ❤❤❤ : neutral
Wow : Positive
13.8 billion years, says who? Allah states that He created the universe, the heavens
and earth and all that is in, on and between them in six days, as He has stated in
several Ayat in the Qur'an. : neutral
I get nostalgia of animal planet : neutral
Why it feels more like some discovery stuff then a Netflix show😅 : Positive
This is explosive ❤❤ : neutral
Ashish.... epic voice : Positive
This line 👉We are connected to the start ❤
Yes actually we are the start and last i think, what do you think?
Different animals have different mind but anyone want to live with their mothers(🌎),
but we Are humans we are Destroyed trees, killing animals, increasing population,
waste water , waste food, polute air,😔😓 etc and want to live long ..... How funny
that you guys 😅 : Positive
Proud of Indian 🚩🇮🇳 : Positive
That's why I subscribe Netflix : neutral
Morgan Freeman is GOD : neutral
me suena a una imitación de cosmos espero que este buena : neutral
Explanation:





This code is for retrieving comments from a YouTube video and performing sentiment analysis on the
comments using the TextBlob library in Python.
The code imports pandas, requests, and TextBlob libraries, and sets the video ID, maximum number of
comments to retrieve, and API key for accessing the YouTube API.
The code then sends an API request to retrieve information about the video and another API request to
retrieve the comments for the video, using the video ID and API key.
The retrieved data is then converted into a pandas DataFrame for further analysis. The code then extracts
the comments from the DataFrame and performs sentiment analysis on each comment using the
TextBlob library.
Finally, the code creates a new pandas DataFrame with the comments and their corresponding sentiment
and saves the data to a CSV file named "Ashish_YouTube_Comments_Sentiment.csv".
Conclusion:
Data collection from social media platforms is essential for businesses to gain insights into customer
behavior, market trends, and brand perception. There are various data collection methods available, including
scraping, crawling, and parsing. Each method has its own advantages and disadvantages, and businesses should
choose the method that suits their needs and objectives. The execution steps for data collection involve
identifying the social media platform(s) of interest, setting up a data collection tool, defining the data fields,
configuring the data collection tool, running the data collection tool, and storing and analyzing the data.
EXPERIMENT NO. 3.
Aim: Data Cleaning and Storage- Preprocess, filter and store social media data for business (Using Python,
MongoDB, R, etc).
Theory:
After collecting social media data for businesses, the next step is data cleaning and storage. The data
collected from social media platforms is often unstructured and noisy, making it difficult to analyze. In this
article, we will discuss the theory behind data cleaning and storage and the execution steps to preprocess, filter,
and store social media data for businesses.
Data cleaning is the process of identifying and correcting errors and inconsistencies in data. Social
media data is often noisy, containing irrelevant or inaccurate information, such as misspellings, abbreviations,
and emoticons. Data cleaning involves removing such information to make the data more accurate and reliable.
Data storage involves storing the cleaned data in a database or other storage systems. Social media data is often
big and unstructured, making it difficult to store in traditional databases. Therefore, businesses can use NoSQL
databases, such as MongoDB or Apache Cassandra, to store social media data.
Execution Steps:
1. Preprocess the data: Preprocess the data by removing irrelevant or inaccurate information, such as
misspellings, abbreviations, and emoticons. Businesses can use data cleaning tools, such as TextBlob or
NLTK, to preprocess social media data.
2. Filter the data: Filter the data by selecting only relevant data fields, such as tweets, hashtags, user
profiles, and location. Businesses can use filtering tools, such as pandas or dplyr, to filter social media
data.
3. Store the data: Store the cleaned and filtered data in a database or other storage systems. Businesses
can use NoSQL databases, such as MongoDB or Apache Cassandra, to store social media data.
4. Index the data: Index the data to make it searchable and retrievable. Businesses can use indexing tools,
such as Elasticsearch or Solr, to index social media data.
5. Analyze the data: Analyze the data using data analysis tools, such as Excel, Python, or R. Businesses
can use data analysis tools to gain insights into customer behavior, market trends, and brand perception.
Code:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import string
import re
%matplotlib inline
pd.set_option('display.max_colwidth', 100)
# Load dataset
def load_data():
data = pd.read_csv('../input/Ashish_Data.csv')
return data
tweet_df = load_data()
tweet_df.head()
print('Dataset size:',tweet_df.shape)
print('Columns are:',tweet_df.columns)
tweet_df.info()
sns.countplot(x = 'ADR_label', data = tweet_df)
# Exploratory Data Analysis
# Wordcloud Visualization
df = pd.DataFrame(tweet_df[['UserId', 'Tweet']])
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
# Start with one review:
df_ADR = tweet_df[tweet_df['ADR_label']==1]
df_NADR = tweet_df[tweet_df['ADR_label']==0]
tweet_All = " ".join(review for review in df.Tweet)
tweet_ADR = " ".join(review for review in df_ADR.Tweet)
tweet_NADR = " ".join(review for review in df_NADR.Tweet)
fig, ax = plt.subplots(3, 1, figsize = (30,30))
# Create and generate a word cloud image:
wordcloud_ALL = WordCloud(max_font_size=50, max_words=100, background_color="white").
generate(tweet_All)
wordcloud_ADR = WordCloud(max_font_size=50, max_words=100, background_color="white").
generate(tweet_ADR)
wordcloud_NADR = WordCloud(max_font_size=50, max_words=100, background_color="white")
.generate(tweet_NADR)
# Display the generated image:
ax[0].imshow(wordcloud_ALL, interpolation='bilinear')
ax[0].set_title('All Tweets', fontsize=30)
ax[0].axis('off')
ax[1].imshow(wordcloud_ADR, interpolation='bilinear')
ax[1].set_title('Tweets under ADR Class',fontsize=30)
ax[1].axis('off')
ax[2].imshow(wordcloud_NADR, interpolation='bilinear')
ax[2].set_title('Tweets under None - ADR Class',fontsize=30)
ax[2].axis('off')
#wordcloud.to_file("img/first_review.png")
string.punctuation
def remove_punct(text):
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
df['Tweet_punct'] = df['Tweet'].apply(lambda x: remove_punct(x))
df.head(10)
# Tokenization
def tokenization(text):
text = re.split('\W+', text)
return text
df['Tweet_tokenized'] = df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))
df.head()
# Remove stopwords
stopword = nltk.corpus.stopwords.words('english')
#stopword.extend(['yr', 'year', 'woman', 'man', 'girl','boy','one', 'two', 'sixteen',
'yearold', 'fu', 'weeks', 'week',
#
'treatment', 'associated', 'patients', 'may','day', 'case','old'])
def remove_stopwords(text):
text = [word for word in text if word not in stopword]
return text
df['Tweet_nonstop'] = df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x))
df.head(10)
# Stemming and Lammitization
# Ex - developed, development
ps = nltk.PorterStemmer()
def stemming(text):
text = [ps.stem(word) for word in text]
return text
df['Tweet_stemmed'] = df['Tweet_nonstop'].apply(lambda x: stemming(x))
df.head()
wn = nltk.WordNetLemmatizer()
def lemmatizer(text):
text = [wn.lemmatize(word) for word in text]
return text
df['Tweet_lemmatized'] = df['Tweet_nonstop'].apply(lambda x: lemmatizer(x))
df.head()
def clean_text(text):
text_lc = "".join([word.lower() for word in text if word not in string.punctuatio
n]) # remove puntuation
text_rc = re.sub('[0-9]+', '', text_lc)
tokens = re.split('\W+', text_rc)
# tokenization
text = [ps.stem(word) for word in tokens if word not in stopword] # remove stopw
ords and stemming
return text
# Vectorisation
countVectorizer = CountVectorizer(analyzer=clean_text)
countVector = countVectorizer.fit_transform(df['Tweet'])
print('{} Number of tweets has {} words'.format(countVector.shape[0], countVector.sha
pe[1]))
#print(countVectorizer.get_feature_names())
count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_featu
re_names())
#count_vect_df.head()
# Feature Creation
ADR_tweet_1 = tweet_df[tweet_df['ADR_label'] == 1]['Tweet'].apply(lambda x: len(x) len(' '))
ADR_tweet_0 = tweet_df[tweet_df['ADR_label'] == 0]['Tweet'].apply(lambda x: len(x) len(' '))
bins_ = np.linspace(0, 450, 70)
plt.hist(ADR_tweet_1, bins= bins_, normed=True, alpha = 0.5, label = 'ADR')
plt.hist(ADR_tweet_0, bins= bins_, normed=True, alpha = 0.1, label = 'None_ADR')
plt.legend()
Output:
23516 Number of tweets has 14323 words
Explanation:
This code is written in Python and is used for Exploratory Data Analysis (EDA) on a dataset containing
tweets related to drugs. Here's a summary of the code:







Libraries such as pandas, numpy, seaborn, matplotlib, and nltk are imported.
The dataset is loaded using the load_data function, and the size and columns of the dataset are printed.
A countplot is plotted to show the number of tweets in each ADR class (0 or 1).
A WordCloud is plotted to visualize the most frequent words in all tweets, tweets under ADR class, and
tweets under None-ADR class.
The tweets are preprocessed by removing punctuation, tokenizing the words, removing stopwords, and
stemming and lemmatizing the words.
The preprocessed tweets are vectorized using the CountVectorizer class.
The length of tweets in both ADR classes is plotted using histograms to show the frequency distribution
of tweet lengths.
The code performs the EDA on the tweet dataset to gain insights and prepare the data for machine learning
models that can predict ADR classes of tweets based on their contents.
Conclusion:
Data cleaning and storage are essential steps in the social media data analysis process. The data
collected from social media platforms is often unstructured and noisy, making it difficult to analyze. Therefore,
businesses need to preprocess, filter, and store social media data to make it more accurate and reliable. They can
use data cleaning tools, such as TextBlob or NLTK, and filtering tools, such as pandas or dplyr, to preprocess
and filter social media data. Businesses can also use NoSQL databases, such as MongoDB or Apache
Cassandra, to store social media data. Finally, businesses can use data analysis tools, such as Excel, Python, or
R, to gain insights into customer behavior, market trends, and brand perception.
EXPERIMENT NO. 4.
Aim: Exploratory Data Analysis and visualization of Social Media Data for business.
Theory:
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to
discover trends, patterns, or to check assumptions with the help of statistical summary and graphical
representations.
An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important
for a company because it exposes trends, patterns, and relationships that are not readily apparent.
The four types of EDA are
1. Univariate non-graphical,
2. Multivariate non-graphical,
3. Univariate graphical,
4. Multivariate graphical.
Techniques and Tools:
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken
than by particular techniques.
Typical graphical techniques used in EDA are:

Box plot

Histogram

Multi-vari chart

Run chart

Pareto chart

Scatter plot (2D/3D)

Stem-and-leaf plot

Parallel coordinates

Odds ratio

Heat map

Bar chart

Horizon graph

Dimensionality reduction:

Multidimensional scaling

Principal component analysis (PCA)

Multilinear PCA

Iconography of correlations
Code:
# Import necessary libraries
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computing
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns # For advanced data visualization
# Load the social media data into a pandas DataFrame
social_media_data = pd.read_csv('ashish_social_media_data.csv')
# Explore the data using summary statistics
print(social_media_data.describe())
# Visualize the data using histograms and box plots
plt.figure(figsize=(10,6))
sns.histplot(social_media_data['likes'], kde=True) # Histogram of the 'likes' column
plt.title('Distribution of Likes') # Add title
plt.xlabel('Number of Likes') # Add x-axis label
plt.ylabel('Frequency') # Add y-axis label
plt.show()
plt.figure(figsize=(10,6))
sns.boxplot(x='platform', y='followers', data=social_media_data) # Box plot of 'follo
wers' column by 'platform' column
plt.title('Followers by Platform') # Add title
plt.xlabel('Platform') # Add x-axis label
plt.ylabel('Number of Followers') # Add y-axis label
plt.show()
# Identify correlations between variables
corr_matrix = social_media_data.corr() # Create a correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') # Heatmap of the correlation ma
trix
plt.title('Correlation Matrix') # Add title
plt.show()
# Visualize the data using scatter plots
plt.figure(figsize=(10,6))
sns.scatterplot(x='comments', y='engagement_rate', data=social_media_data) # Scatter
plot of 'comments' column vs 'engagement_rate' column
plt.title('Comments vs Engagement Rate') # Add title
plt.xlabel('Number of Comments') # Add x-axis label
plt.ylabel('Engagement Rate') # Add y-axis label
plt.show()
# Identify outliers using box plots
plt.figure(figsize=(10,6))
sns.boxplot(x='platform', y='likes', data=social_media_data) # Box plot of 'likes' co
lumn by 'platform' column
plt.title('Likes by Platform') # Add title
plt.xlabel('Platform') # Add x-axis label
plt.ylabel('Number of Likes') # Add y-axis label
plt.show()
# Create a pairplot to visualize the relationships between all variables
sns.pairplot(data=social_media_data) # Pairplot of all columns
plt.show()
Output:
count
mean
std
min
25%
50%
75%
max
followers
10.000000
13200.000000
9199.033766
2000.000000
6250.000000
10000.000000
18750.000000
30000.000000
likes
10.000000
700.000000
637.268476
50.000000
262.500000
500.000000
950.000000
2000.000000
comments
10.000000
46.500000
35.043624
5.000000
21.250000
40.000000
68.750000
100.000000
engagement_rate
10.0000
4.4500
1.9352
2.0000
2.6250
4.5000
5.8250
7.5000
Explanation:
This code performs exploratory data analysis on a social media dataset. It first imports necessary
libraries, loads the dataset into a Pandas DataFrame, and then uses summary statistics and data visualization
techniques such as histograms, box plots, scatter plots, and correlation matrices to explore the dataset.
The code also identifies outliers using box plots and creates a pairplot to visualize the relationships
between all variables in the dataset. Overall, the code helps to gain insights and better understand the social
media data.
Conclusion:
Exploratory Data Analysis and visualization of social media data are important steps in the data analysis
process. EDA helps businesses to understand the patterns and trends in social media data and identify
relationships between variables.
Visualization is an important part of EDA because it helps businesses to communicate insights and
findings effectively. Businesses can use Python, R, or other data analysis tools to perform EDA and create
visualizations of social media data. Finally, businesses can interpret the findings from EDA and communicate
the insights to relevant stakeholders.
EXPERIMENT NO. 5.
Aim: Develop Content (text, emoticons, image, audio, video) based social media analytics model for business.
(E.g. Content Based Analysis: Topic, Issue, Trend, sentiment/opinion analysis, audio, video, image
analytics).
Theory:
Developing a content-based social media analytics model for businesses is a crucial step in
understanding their audience and improving their social media presence. In this article, we will discuss the
theory behind content-based analysis, the execution steps to develop the model, and the benefits it can offer for
businesses.
Content-based social media analytics is a process of analyzing the content shared on social media
platforms by a business or its customers. It helps businesses to understand their audience, identify emerging
trends, and improve their social media strategy. Content-based analysis can be used for text, emoticons, image,
audio, and video analytics.




Topic Analysis: It is the process of identifying the topics and themes in social media content. This can
help businesses to understand the interests of their audience and create content that resonates with them.
Issue Analysis: It is the process of identifying the issues and concerns of the audience. This can help
businesses to address the concerns of their audience and improve their social media presence.
Trend Analysis: It is the process of identifying the emerging trends in social media content. This can
help businesses to stay up-to-date with the latest trends and adapt their social media strategy
accordingly.
Sentiment Analysis: It is the process of identifying the sentiment or opinion expressed in social media
content. This can help businesses to understand the sentiment of their audience towards their brand,
products, or services.
Audio, Video, and Image Analytics: It is the process of analyzing audio, video, and image content shared on
social media platforms. This can help businesses to identify the type of content that resonates with their
audience and create similar content to improve engagement.
Execution Steps:
1. Data Collection: Collect data from social media platforms, including text, emoticons, images, audio,
and video content.
2. Preprocessing: Clean and preprocess the data by removing stop words, converting text to lowercase,
and removing special characters.
3. Feature Extraction: Extract relevant features from the data, such as topics, issues, trends, sentiment,
and image, audio, and video content.
4. Model Development: Develop a machine learning model to analyze the extracted features and provide
insights about the content shared on social media platforms.
5. Model Evaluation: Evaluate the performance of the model using metrics such as accuracy, precision,
and recall.
6. Implementation: Implement the model to analyze the content shared on social media platforms and
provide insights to improve the social media strategy.
Code:
# DataFrame
import pandas as pd
# Matplot
import matplotlib.pyplot as plt
%matplotlib inline
# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxP
ooling1D, LSTM
from keras import utils
from keras.callbacks import ReduceLROnPlateau, EarlyStopping
# nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
# Word2vec
import gensim
# Utility
import re
import numpy as np
import os
from collections import Counter
import logging
import time
import pickle
import itertools
# Set log
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging
.INFO)
nltk.download('stopwords')
# Settings
# DATASET
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
TRAIN_SIZE = 0.8
# TEXT CLENAING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
# WORD2VEC
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10
# KERAS
SEQUENCE_LENGTH = 300
EPOCHS = 8
BATCH_SIZE = 1024
# SENTIMENT
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
SENTIMENT_THRESHOLDS = (0.4, 0.7)
# EXPORT
KERAS_MODEL = "model.h5"
WORD2VEC_MODEL = "model.w2v"
TOKENIZER_MODEL = "tokenizer.pkl"
ENCODER_MODEL = "encoder.pkl"
dataset_filename = os.listdir("../input")[0]
dataset_path = os.path.join("..","input",dataset_filename)
print("Open file:", dataset_path)
df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS)
print("Dataset size:", len(df))
# df.head(5)
# Map target label to String¶
# 0 -> NEGATIVE
# 2 -> NEUTRAL
# 4 -> POSITIVE
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
return decode_map[int(label)]
%%time
df.target = df.target.apply(lambda x: decode_sentiment(x))
target_cnt = Counter(df.target)
plt.figure(figsize=(16,8))
plt.bar(target_cnt.keys(), target_cnt.values())
plt.title("Dataset labels distribuition")
# Pre-Process dataset
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
def preprocess(text, stem=False):
# Remove link,user and special characters
text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
tokens = []
for token in text.split():
if token not in stop_words:
if stem:
tokens.append(stemmer.stem(token))
else:
tokens.append(token)
return " ".join(tokens)
%%time
df.text = df.text.apply(lambda x: preprocess(x))
df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42)
print("TRAIN size:", len(df_train))
print("TEST size:", len(df_test))
# Word2Vec
%%time
documents = [_text.split() for _text in df_train.text]
w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE,
window=W2V_WINDOW,
min_count=W2V_MIN_COUNT,
workers=8)
w2v_model.build_vocab(documents)
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)
%%time
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)
w2v_model.most_similar("love")
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train.text)
vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)
%%time
x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_
LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LE
NGTH)
# Label Encoder
labels = df_train.target.unique().tolist()
labels.append(NEUTRAL)
labels
encoder = LabelEncoder()
encoder.fit(df_train.target.tolist())
y_train = encoder.transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
print("y_train",y_train.shape)
print("y_test",y_test.shape)
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print()
print("x_test", x_test.shape)
print("y_test", y_test.shape)
y_train[:10]
# Embedding layer
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
if word in w2v_model.wv:
embedding_matrix[i] = w2v_model.wv[word]
print(embedding_matrix.shape)
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_l
ength=SEQUENCE_LENGTH, trainable=False)
# Build Model
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# Compile model
model.compile(loss='binary_crossentropy',
optimizer="adam",
metrics=['accuracy'])
# Callbacks
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]
# Train
%%time
history = model.fit(x_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
validation_split=0.1,
verbose=1,
callbacks=callbacks)
# Evaluate
%%time
score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE)
print()
print("ACCURACY:",score[1])
print("LOSS:",score[0])
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'b', label='Training acc')
plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
# Predict
def decode_sentiment(score, include_neutral=True):
if include_neutral:
label = NEUTRAL
if score <= SENTIMENT_THRESHOLDS[0]:
label = NEGATIVE
elif score >= SENTIMENT_THRESHOLDS[1]:
label = POSITIVE
return label
else:
return NEGATIVE if score < 0.5 else POSITIVE
def predict(text, include_neutral=True):
start_at = time.time()
# Tokenize text
x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENG
TH)
# Predict
score = model.predict([x_test])[0]
# Decode sentiment
label = decode_sentiment(score, include_neutral=include_neutral)
return {"label": label, "score": float(score),
"elapsed_time": time.time()-start_at}
predict("I love the music")
predict("I hate the rain")
predict("i don't know what i'm doing")
# Confusion Matrix
%%time
y_pred_1d = []
y_test_1d = list(df_test.target)
scores = model.predict(x_test, verbose=1, batch_size=8000)
y_pred_1d = [decode_sentiment(score, include_neutral=False) for score in scores]
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=30)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90, fontsize=22)
plt.yticks(tick_marks, classes, fontsize=22)
fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label', fontsize=25)
plt.xlabel('Predicted label', fontsize=25)
%%time
cnf_matrix = confusion_matrix(y_test_1d, y_pred_1d)
plt.figure(figsize=(12,12))
plot_confusion_matrix(cnf_matrix, classes=df_train.target.unique(), title="Confusion
matrix")
plt.show()
# Classification Report
print(classification_report(y_test_1d, y_pred_1d))
# Accuracy Score
accuracy_score(y_test_1d, y_pred_1d)
# Save model
model.save(KERAS_MODEL)
w2v_model.save(WORD2VEC_MODEL)
pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0)
pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0)
Output:
Layer (type)
Output Shape
Param #
=================================================================
embedding_1 (Embedding)
(None, 300, 300)
87125700
_________________________________________________________________
dropout_1 (Dropout)
(None, 300, 300)
0
_________________________________________________________________
lstm_1 (LSTM)
(None, 100)
160400
_________________________________________________________________
dense_1 (Dense)
(None, 1)
101
=================================================================
ACCURACY: 0.791134375
LOSS: 0.4442952796936035
CPU times: user 2min 25s, sys: 16.9 s, total: 2min 42s
Wall time: 1min 52s
CPU times: user 1.39 s, sys: 256 ms, total: 1.64 s
Wall time: 1.38 s
precision
recall f1-score
support
NEGATIVE
POSITIVE
0.79
0.79
0.79
0.80
0.79
0.79
159494
160506
micro avg
macro avg
weighted avg
0.79
0.79
0.79
0.79
0.79
0.79
0.79
0.79
0.79
320000
320000
320000
Explanation:
This is a Python script for performing sentiment analysis on a Twitter dataset using machine learning
techniques. The script imports various libraries such as pandas, matplotlib, scikit-learn, Keras, nltk, and gensim.
The dataset is cleaned, pre-processed, and transformed into a format that can be used to train and test machine
learning models.
The script then trains a machine learning model using Keras, and the trained model is used to make
predictions on the test set. The performance of the model is evaluated using various metrics such as confusion
matrix, classification report, and accuracy score. Finally, the trained model is exported for future use.
Conclusion:
Content-based social media analytics is a crucial step in understanding the audience and improving the
social media presence of businesses. It involves analyzing the content shared on social media platforms,
including text, emoticons, images, audio, and video content, to identify topics, issues, trends, sentiment, and
image, audio, and video content.
Developing a content-based social media analytics model involves data collection, preprocessing,
feature extraction, model development, model evaluation, and implementation. By using content-based social
media analytics, businesses can improve their social media strategy and engage with their audience effectively.
EXPERIMENT NO. 6.
Aim: Develop Structure based social media analytics model for any business. (E.g. Structure Based Models community detection, influence analysis).
Theory:
Developing a structure-based social media analytics model for any business is a critical step in
understanding the network of their audience and identifying influential users. In this article, we will discuss the
theory behind structure-based models, the execution steps to develop the model, and the benefits it can offer for
businesses.
Structure-based social media analytics is a process of analyzing the social network structure of a
business or its customers. It helps businesses to understand the relationships between their audience and identify
influential users. Structure-based analysis can be used for community detection and influence analysis.
 Community Detection: It is the process of identifying groups or communities of users within a social
network. This can help businesses to understand the interests and preferences of different user groups
and create targeted content to improve engagement.
 Influence Analysis: It is the process of identifying influential users within a social network. This can
help businesses to identify users who have a significant impact on their audience and engage with them
to improve their social media presence.
Execution Steps:
1. Data Collection: Collect data from social media platforms, including user profiles, followers, and
interactions between users.
2. Network Construction: Construct a social network graph based on the collected data, with nodes
representing users and edges representing interactions between them.
3. Community Detection: Use community detection algorithms to identify groups or communities of
users within the social network.
4. Influence Analysis: Use influence analysis algorithms to identify influential users within the social
network.
5. Model Development: Develop a machine learning model to analyze the network structure and provide
insights about the social network.
6. Model Evaluation: Evaluate the performance of the model using metrics such as modularity and
centrality.
7. Implementation: Implement the model to analyze the social network structure and provide insights to
improve the social media strategy.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import re
df_i=pd.read_csv('C:/Users/THEASHISHGAVADE/Downloads/social media influencers - insta
gram.csv')
def convert(x):
return re.findall('\d+\.?\d*',x)
def change(df,list1):
for i in list1:
df['new'+i]=df[i].apply(convert)
df['new'+i]=df['new'+i].apply(lambda x: "".join(x))
df['new'+i]=pd.to_numeric(df['new'+i])
df['new'+i]=np.where(['M' in j for j in df[i]],df['new'+i]*1000000,
np.where(['K' in j1 for j1 in df[i]],df['new'+i]*1000,df
['new'+i]))
return df
df_i.head(2)
df_i.rename({'category_1':'Category','Audience country(mostly)':'Audience Country'},a
xis=1,inplace=True)
df_i.head(2)
df_i.isnull().sum()
df_i.drop_duplicates(subset=['Influencer insta name'],inplace=True)
df_i.shape
df_i.drop(labels=['Influencer insta name','Authentic engagement\r\n'],axis=1,inplace=
True)
df_i.head(2)
df_i.head(2)
change(df_i,li)
#Engagement rate : the Percentage of Followers who really engages with the content po
sted by Influencers
df_i['Engagement Rate']=np.round((df_i['newEngagement avg\r\n']/df_i['newFollowers'])
*100,3)
print(df_i['Followers'].str[-1].unique())
# for convenice
df_i['newFollowers']=df_i['newFollowers']/1000000
df_i.drop(labels=['Engagement avg\r\n','newEngagement avg\r\n'],axis=1,inplace=True)
df_i.head(5)
# TOP 15 most followed celebrity on instagram
df_i.sort_values(by='newFollowers',ascending=False,ignore_index=True).iloc[0:15,[0,1,
3,-1]]
plt.title('Top 15 most followed celebrity on instagram')
plt.xlabel('Followers in Million')
sns.barplot(y='instagram name',x='newFollowers',data=df_i.sort_values(by='newFollower
s',ascending=False).head(15))
pallete=['red','green','yellow','salmon','cyan','blue','orange']
def plot(df):
plt.figure(figsize=(8,6))
plt.xlabel('number of times category occured')
plt.ylabel('Category')
df['Category'].value_counts().sort_values(ascending=True).plot.barh(color=pallete
)
# TOP categories followed on instagram(POPULAR CATEGORIES ON INSTAGRAM)
plot(df_i)
# Decide That where you want to make ads
def plot_c(df):
plt.figure(figsize=(10,8))
plt.xlabel('number of times category occured')
df['Audience Country'].value_counts().sort_values().plot.barh(color=pallete)
# for understanding that where is the demand of product
def demand(data,category):
return data[data['Category']==category]['Audience Country'].value_counts().sort_v
alues(ascending=True).plot.barh(color=pallete)
demand(df_i,'Lifestyle')
df_i['newFollowers'].describe()
df_i['newFollowers'].quantile(0.94)
df_i.head(2)
def for_mini_followers_instagram(coun,cat):
df1=df_i[df_i['Audience Country']==coun]
df1_mini=df1[df1['newFollowers']<60]
return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego
ry').get_group(cat).iloc[:,[0,3,-1]]
for_mini_followers_instagram('India','Music')
def for_mega_followers_instagram(coun,cat):
df1=df_i[df_i['Audience Country']==coun]
df1_mini=df1[df1['newFollowers']>60]
return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego
ry').get_group(cat).iloc[:,[0,3,-1]]
for_mega_followers_instagram('India','Music')
for_mini_followers_instagram('India','Beauty')
for_mini_followers_instagram('India','Shows')
for_mini_followers_instagram('India','Sports with a ball')
for_mega_followers_instagram('India','Sports with a ball')
for_mega_followers_instagram('Brazil','Sports with a ball')
Output:
Influencer insta name
instagram name
Category
category_2
Followers
Audience Country
Authentic engagement\r\n
Engagement avg\r\n
dtype: int64
0
21
108
713
0
14
0
0
instagram name
Category
category_2
Followers
Audience Country
newFollowers
Engagement Rate
0
433
Sports with a ball
NaN
48.5M
Spain
48.5
1.313
1
TAEYANG
Music
NaN
12.7M
Indonesia
12.7
4.270
2
НАСТЯ ИВЛЕЕВА
Shows
NaN
18.8M
Russia
18.8
2.010
3
Joy
Lifestyle
NaN
13.5M
Indonesia
13.5
10.370
4
Jaehyun
NaN
NaN
11.1M
Indonesia
11.1
27.928
count
997.000000
mean
25.539619
std
40.586338
min
2.600000
25%
9.000000
50%
14.600000
75%
26.500000
max
487.200000
Name: newFollowers, dtype: float64
Explanation:
This code is analyzing data related to social media influencers on Instagram. It uses various Python
libraries such as NumPy, Pandas, Matplotlib, and Seaborn for data processing, visualization, and analysis. The
code loads a CSV file containing data related to social media influencers and performs various data cleaning
and manipulation operations such as dropping duplicates, converting data types, and renaming columns.
After cleaning the data, the code generates various visualizations such as a bar plot of the top 15 most
followed celebrities on Instagram, a bar plot of the most popular categories followed on Instagram, and a bar
plot of the countries with the highest demand for different product categories.
Lastly, the code defines several functions that can be used to filter the data based on different criteria
such as country, category, and number of followers. These functions can be used to analyze the engagement rate
of influencers, identify the most popular categories among followers, and find influencers with high
engagement rates in specific categories and countries.
Conclusion:
Structure-based social media analytics is a crucial step in understanding the network of the audience and
identifying influential users for businesses. It involves analyzing the social network structure based on user
profiles, followers, and interactions between users.
Developing a structure-based social media analytics model involves data collection, network
construction, community detection, influence analysis, model development, model evaluation, and
implementation. By using structure-based social media analytics, businesses can improve their social media
strategy, identify influential users, and engage with them to improve their social media presence.
EXPERIMENT NO. 7.
Aim: Develop a dashboard and reporting tool based on real time social media data.
Theory:
Developing a dashboard and reporting tool based on real-time social media data is a critical step in
monitoring the performance of a business's social media strategy. In this article, we will discuss the theory
behind social media dashboards, the execution steps to develop the dashboard and reporting tool, and the
benefits it can offer for businesses.
A social media dashboard is a tool that provides real-time monitoring of social media activities,
including engagement, reach, and impressions. It helps businesses to track the performance of their social media
strategy and identify areas for improvement. The dashboard can provide data visualization, including charts and
graphs, to enable users to interpret the data easily.
Reporting tools are used to create regular reports based on the data gathered by the dashboard. These
reports can provide insights into the performance of the social media strategy, identify trends, and suggest areas
for improvement.
Execution Steps:
1. Identify Key Performance Indicators (KPIs): Identify the KPIs that are important for the business, such
as engagement rate, reach, and impressions.
2. Data Collection: Collect data from social media platforms, including user profiles, followers, and
interactions between users.
3. Data Processing: Pre-process the data to remove any irrelevant data and prepare it for analysis.
4. Data Visualization: Use data visualization tools such as charts and graphs to represent the data in an
easily understandable format.
5. Dashboard Development: Develop a dashboard using a tool such as Tableau, Power BI, or Google Data
Studio.
6. Reporting Tool Development: Develop a reporting tool that generates regular reports based on the data
collected by the dashboard.
7. Implementation: Implement the dashboard and reporting tool to provide real-time monitoring of social
media activities and generate regular reports for the business.
Code:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import string
import re
import textblob
from textblob import TextBlob
from wordcloud import wordcloud, STOPWORDS
from wordcloud import ImageColorGenerator
import warnings
%matplotlib inline
import os
#using os library to call CLI commands in python
os.system("snscrape --jsonl --max-results 10000 --since 2023-03-13 twittersearch 'CHATGPT4' > text-chatGPT4-tweets.json")
#create a pandas dataframe
tweets_df_chatGPT4 = pd.read_json('text-chatGPT4-tweets.json',lines=True)
#tweets_df_chatGPT4
# 2. Data Loading
df_chatGPT4 = tweets_df_chatGPT4[["date", "rawContent","renderedContent","user","repl
yCount","retweetCount","likeCount","lang","place","hashtags","viewCount"]]
print(df_chatGPT4.shape)
# 3. Twitter Data Cleaning, Preprocessing and Exploratory Data Analysis
df2 = df_chatGPT4.drop_duplicates('renderedContent')
# shape of DataFrame
print(df2.shape)
df2.head()
df2.info()
df2.date.value_counts()
#Heatmap for missing values
plt.figure(figsize=(17,5))
sns.heatmap(df2.isnull(),cbar=True,yticklabels=False)
plt.xlabel("Column_Name", size=14,weight="bold")
plt.title("Places of missing values is cloumn",fontweight="bold",size=17)
plt.show()
Output:
date
rawConten
t
renderedC
ontent
user
replyC
ount
retweetC
ount
likeCo
unt
la
ng
pla
ce
hashtags
viewC
ount
0
2023-0401
12:46:53+
00:00
So I keep
using up all
the
ChatGPT4
20 questio...
So I keep
using up all
the
ChatGPT4
20 questio...
{'_type':
'snscrape.modules.tw
itter.User', 'us...
0
0
0
en
No
ne
None
NaN
1
2023-0401
12:46:32+
00:00
@theDontG
etRekt
@mreflow
No in
ChatGPT4.
You c...
@theDontG
etRekt
@mreflow
No in
ChatGPT4.
You c...
{'_type':
'snscrape.modules.tw
itter.User', 'us...
0
0
0
en
No
ne
None
1.0
2
2023-0401
12:43:04+
00:00
Italia
Berencana
Memblokir
ChatGPT,
Kenapa
Ya?...
Italia
Berencana
Memblokir
ChatGPT,
Kenapa
Ya?...
No
ne
[ChatGPT,
chatgpt4,
Italia,
OpenAICha
tGPT,
Ope...
6.0
{'_type':
'snscrape.modules.tw
itter.User', 'us...
0
0
0
in
date
rawConten
t
renderedC
ontent
user
replyC
ount
retweetC
ount
likeCo
unt
la
ng
pla
ce
hashtags
viewC
ount
3
2023-0401
12:40:41+
00:00
essa parada
do chat gpt4
é sinistra,
daqui a p...
essa parada
do chat gpt4
é sinistra,
daqui a p...
{'_type':
'snscrape.modules.tw
itter.User', 'us...
0
0
0
pt
No
ne
None
7.0
4
2023-0401
12:40:29+
00:00
Basically :
\n\nI had
expressed
my feelings.
🤪...
Basically :
\n\nI had
expressed
my feelings.
🤪...
{'_type':
'snscrape.modules.tw
itter.User', 'us...
No
ne
[ChatGPT4,
ChatGPT5,
technology,
OpenAI,
ChatG...
13.0
1
0
0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3335 entries, 0 to 3368
Data columns (total 11 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
date
3335 non-null
datetime64[ns, UTC]
1
rawContent
3335 non-null
object
2
renderedContent 3335 non-null
object
3
user
3335 non-null
object
4
replyCount
3335 non-null
int64
5
retweetCount
3335 non-null
int64
6
likeCount
3335 non-null
int64
7
lang
3335 non-null
object
8
place
54 non-null
object
9
hashtags
1949 non-null
object
10 viewCount
3332 non-null
float64
dtypes: datetime64[ns, UTC](1), float64(1), int64(3), object(6)
memory usage: 312.7+ KB
en
Explanation:





This code is for collecting and analyzing tweets related to a specific Twitter user or topic.
The code uses the Python libraries pandas, snscrape, numpy, matplotlib, seaborn, nltk, re, textblob, and
wordcloud to perform data cleaning, preprocessing, and exploratory data analysis on the collected
tweets.
The snscrape library is used to scrape Twitter data based on a search query, in this case for tweets
related to the Twitter user "CHATGPT4". The collected data is stored in a JSON file and then read into
a pandas dataframe.
The dataframe is then cleaned to remove duplicate tweets and missing values. Exploratory data analysis
is performed using visualizations such as a heatmap to show the places of missing values in the
dataframe.
The code also uses various natural language processing techniques such as tokenization, stopword
removal, stemming, and sentiment analysis using the textblob library. Finally, a word cloud is generated
to visualize the most frequent words used in the collected tweets.
Conclusion:
Developing a dashboard and reporting tool based on real-time social media data is a critical step in
monitoring the performance of a business's social media strategy. It involves identifying the KPIs, collecting
and pre-processing data, visualizing data, and developing the dashboard and reporting tool.
By using a social media dashboard and reporting tool, businesses can monitor the performance of their
social media strategy in real-time, identify areas for improvement, and generate regular reports to improve their
social media presence.
EXPERIMENT NO. 8.
Aim: Design the creative content for promotion of your business on social media platform.
Theory:
Designing creative content for promotion of your business on social media platform is a crucial aspect of
building a strong online presence. In this article, we will discuss the theory behind designing creative content
for social media, the execution steps to create effective content, and the benefits it can offer for businesses.
Designing creative content for social media requires a good understanding of your target audience and
the social media platform you plan to use. Effective content must capture the audience's attention and
communicate the brand's message. Content can take various forms, such as text, images, videos, or
infographics. The content must be engaging, informative, and relevant to the target audience.
Execution Steps:
1. Define the Target Audience: Identify the target audience, including their interests, preferences, and
behaviors, to create content that resonates with them.
2. Choose the Social Media Platform: Choose the social media platform based on the target audience and
the business's goals. Different platforms have different formats and audience demographics.
3. Develop a Content Strategy: Develop a content strategy that aligns with the business's goals and the
target audience's needs. The strategy should include the type of content, frequency, and tone.
4. Create Content: Create content that aligns with the content strategy, using various formats such as text,
images, videos, or infographics. Ensure that the content is relevant, informative, and engaging.
5. Optimize Content: Optimize the content for the chosen social media platform, such as using the right
hashtags, keywords, and image sizes.
6. Schedule Content: Schedule the content using social media management tools such as Hootsuite, Buffer,
or Sprout Social.
7. Analyze Performance: Analyze the performance of the content using social media analytics tools such as
Facebook Insights, Twitter Analytics, or Google Analytics. Use the insights to refine the content
strategy.
A social media ad campaign to promote a fictional technology business called "TechBoost":
Ad #1: Image: A sleek, modern laptop with the TechBoost logo on the screen.
"Upgrade your tech game with TechBoost. Our high-performance laptops are designed to keep up with your
busy lifestyle, whether you're a student, professional, or gamer. Shop now and experience the power of
TechBoost!"
Ad #2: Image: A person working on their laptop in a coffee shop, with the TechBoost logo visible on the
back of their laptop screen.
"Take your work with you wherever you go. With TechBoost, you can work from anywhere with ease.
Our laptops are lightweight and easy to carry, so you can stay productive on the go."
Ad #3: Image: A person gaming on a TechBoost laptop, with a high-resolution game visible on the screen.
"Experience the ultimate gaming performance with TechBoost. Our laptops are equipped with the latest
graphics and processors, so you can enjoy your favorite games at the highest level. Get yours now and take
your gaming to the next level!"
Overall, the campaign showcases TechBoost's high-performance laptops, portability, and versatility. The
use of sleek, modern visuals and bold copy is intended to appeal to tech-savvy individuals who are looking
for high-quality products.
Conclusion:
Designing creative content for promotion of your business on social media platform is critical to building a
strong online presence.
It involves understanding the target audience, choosing the right social media platform, developing a
content strategy, creating and optimizing the content, scheduling the content, and analyzing its performance. By
creating engaging and relevant content, businesses can attract and retain the target audience, increase brand
awareness, and achieve their business goals.
EXPERIMENT NO. 9.
Aim: Analyze competitor activities using social media data.
Theory:
Analyzing competitor activities using social media data is an essential part of a business's social media
strategy. It helps businesses understand their competitors' strengths and weaknesses, identify new opportunities,
and improve their own social media performance. In this article, we will discuss the theory behind analyzing
competitor activities using social media data, the execution steps to perform this analysis, and the benefits it can
offer for businesses.
Analyzing competitor activities using social media data involves monitoring and analyzing their social
media activities, such as their content, engagement metrics, audience demographics, and advertising campaigns.
It helps businesses gain insights into their competitors' social media strategies, benchmark their own
performance, and identify areas for improvement. Effective competitor analysis requires a good understanding
of the social media platforms used by competitors and the tools available to monitor their activities.
Execution Steps:
1. Identify Competitors: Identify the competitors that the business wants to analyze based on their
industry, target audience, and social media presence.
2. Determine Social Media Platforms: Determine the social media platforms used by the competitors and
the frequency and types of content they post.
3. Monitor Competitor Activity: Monitor the competitors' social media activities using social media
management tools such as Hootsuite, Buffer, or Sprout Social. This helps identify the frequency and
type of content posted by the competitors.
4. Analyze Engagement Metrics: Analyze engagement metrics such as likes, comments, shares, and
followers, to identify the type of content that resonates with the audience and the level of audience
engagement.
5. Evaluate Advertising Campaigns: Evaluate the competitors' advertising campaigns using tools such as
Facebook Ads Library or Twitter Ads Transparency Center, to determine the target audience and ad
spend.
6. Benchmark Performance: Benchmark the business's social media performance against the competitors
using metrics such as audience growth, engagement rate, and advertising spend.
7. Identify Opportunities: Identify new opportunities for the business based on the insights gained from
the competitor analysis, such as new content ideas, audience demographics, or advertising strategies.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/linkedin-influencers-data/ashish_influencers_data.csv')
df.head(3)
#df.tail(3)
df.shape
df.columns
df.info()
df.describe()
df['name'].unique()
df['name'].unique().shape
df.isna().sum()
df = df.drop(columns = ['views', 'votes', 'media_type', 'content', 'connections', 'lo
cation'])
df.isna().sum()
df_first = df[(df['name']=='Ashish Gavade')]
df_first.head()
df_first.info()
df_first['followers'] = df_first['followers'].dropna()
df_first.isna().sum()
df_first['followers'] = df_first['followers'].astype(int)
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['followers'], color='gray')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Followers', fontsize='25')
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['reactions'], color='forestgreen')
ax.bar(df_first['time_spent'], df_first['comments'], color='Blue')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Reaction Vs. Comments', fontsize='25')
plt.legend(['Reactions', 'Comments'])
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_first['time_spent'], df_first['num_hashtags'], color='Purple')
ax.bar(df_first['time_spent'], df_first['hashtag_followers'], color='Lightseagreen')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Numbers', fontsize='11')
plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25')
plt.legend(['Number of Hashtags', 'Hashtag Followers'])
plt.grid()
plt.show()
df_tom = df_first = df[(df['name']=='Tom Goodwin')]
# df_tom.head()
df_tom['followers'] = df_tom['followers'].dropna()
df_tom.isna().sum()
df_tom.info()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['followers'], color='gray')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Followers', fontsize='25')
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['reactions'], color='forestgreen')
ax.bar(df_tom['time_spent'], df_tom['comments'], color='Blue')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Number of people', fontsize='11')
plt.title('Reaction Vs. Comments', fontsize='25')
plt.legend(['Reactions', 'Comments'])
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize=(20,8))
ax.bar(df_tom['time_spent'], df_tom['num_hashtags'], color='Purple')
ax.bar(df_tom['time_spent'], df_tom['hashtag_followers'], color='Lightseagreen')
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Time Spent', fontsize='11')
ax.set_ylabel('Numbers', fontsize='11')
plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25')
plt.legend(['Number of Hashtags', 'Hashtag Followers'])
plt.grid()
plt.show()
Output:
followers
num_hashtags
hashtag_followers
reactions
comments
views
Unnamed: 0
count
34012.000000
3.397000e+04
34012.000000
34012.0
34012.000000
34012.000000
0.0
mean
17005.500000
1.125922e+06
2.099788
0.0
472.956486
26.977273
NaN
std
9818.563014
3.057750e+06
3.517457
0.0
4163.929944
216.364372
NaN
min
0.000000
1.710000e+02
0.000000
0.0
0.000000
0.000000
NaN
25%
8502.750000
9.914800e+04
0.000000
0.0
7.000000
0.000000
NaN
50%
17005.500000
4.082540e+05
0.000000
0.0
36.000000
2.000000
NaN
75%
25508.250000
7.193340e+05
3.000000
0.0
143.000000
14.000000
NaN
max
34011.000000
1.828935e+07
48.000000
0.0
391498.000000
32907.000000
NaN
array(['Ashish Gavade', 'Jonathan Wolfer', 'Karen Gross',
'Kaia Niambi Shivers Ph.D.', "Daniel Cohen-I'm Flyering",
'Natalie Riso', 'Dale Corley', 'James Calder', 'Yasi Baiani',
'Julie Kliger', 'Stephanie C. O.', 'Michelle de Haaff',
'Bertalan Meskó, MD, PhD', 'Michelle Chaffee', 'Beth Seidenberg',
'Russell Benaroya', 'Richard Branson', 'Mohamed El-Erian',
'James Altucher', 'Bernard Marr', 'Ian Bremmer', 'Sramana Mitra',
'Lynne Everatt', 'Justin Bariso', 'Carson Tate', 'Gary Frisch',
'James Bareham', 'Tai T.', 'Glenn Leibowitz', 'Marianne Griebler',
'Tom Goodwin', 'Katie Martell', 'Shama Hyder',
'Barry W. Enderwick', 'Steve Blakeman', 'Gillian Zoe Segal',
'Tom Foremski', 'Kiara Imani Williams, Esq.', 'Kellye Whitney',
'Simon Owens', 'Rachel Jones', 'Vikram Mansharamani',
'🏉 Pascal Bouvier', 'Geoffrey Garrett', 'Ben Casselman',
'Tamal Bandyopadhyay', 'Karen Webster', 'Jody Padar',
'Hansi Mehrotra', 'Nick Ciubotariu', 'Neil Hughes', 'Nir Eyal',
'Shelly Palmer', 'Lee Naik', 'Danielle Newnham', 'Vani Kola',
'Chris McCann', 'Andrew Yang', 'Lisa Abeyta', 'Juliet de Baubigny',
'Sarah Kauss', 'Pocket Sun', 'Chantel Soumis', 'String Nguyen',
'Quentin Michael Allums', 'AJ Wilcox', "Kevin O'Leary",
'Amy Blaschka', 'Simon Sinek'], dtype=object)
Explanation:
This code is used for analyzing LinkedIn influencers' data by reading a CSV file containing data about
their name, time spent on LinkedIn, number of followers, reactions, comments, hashtags, and hashtag followers.
The code drops some columns from the DataFrame, cleans the data, and creates visualizations to
compare the data of two influencers named Nicholas Wyman and Tom Goodwin. The visualizations show the
number of followers, reactions, comments, number of hashtags, and hashtag followers against the time spent on
LinkedIn for each influencer.
Conclusion: :
Analyzing competitor activities using social media data is crucial for businesses to improve their social
media performance, gain new insights, and identify new opportunities.
It involves identifying competitors, monitoring their social media activities, analyzing engagement
metrics, evaluating advertising campaigns, benchmarking performance, and identifying opportunities. By
performing effective competitor analysis, businesses can gain a competitive edge and achieve their social media
goals.
EXPERIMENT NO. 10.
Aim: Develop social media text analytics models for improving existing product/ service by analyzing customer‘s
reviews/comments.
Theory:
Developing social media text analytics models for improving existing products/services by analyzing
customer reviews/comments is an essential part of a business's social media strategy. It helps businesses
understand customer sentiment and feedback, identify areas for improvement, and take necessary actions to
improve their products/services. In this article, we will discuss the theory behind developing social media text
analytics models, the execution steps to perform this analysis, and the benefits it can offer for businesses.
Developing social media text analytics models involves collecting and analyzing customer
reviews/comments from social media platforms such as Facebook, Twitter, and LinkedIn. Text analytics
techniques such as sentiment analysis, topic modeling, and opinion mining can be used to gain insights into
customer sentiment and feedback, identify areas for improvement, and improve existing products/services.
Execution Steps:
1. Collect Customer Reviews/Comments: Collect customer reviews/comments from social media
platforms such as Facebook, Twitter, and LinkedIn using social media management tools such as
Hootsuite, Buffer, or Sprout Social.
2. Preprocess the Data: Preprocess the data by removing irrelevant information such as URLs, hashtags,
and mentions, and perform text normalization techniques such as tokenization, stemming, and
lemmatization.
3. Perform Sentiment Analysis: Perform sentiment analysis to identify the polarity of customer
reviews/comments, whether positive, negative, or neutral, using techniques such as rule-based, machine
learning, or hybrid approaches.
4. Perform Topic Modeling: Perform topic modeling to identify the topics mentioned in customer
reviews/comments using techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix
Factorization (NMF).
5. Perform Opinion Mining: Perform opinion mining to identify the opinion holders and their views on
specific product/service aspects, such as quality, price, or customer service.
6. Identify Areas for Improvement: Identify the areas for improvement based on the insights gained from
the analysis, such as improving product quality, pricing strategy, or customer service.
7. Take Necessary Actions: Take necessary actions to address the identified areas for improvement, such
as revising product/service features, pricing, or customer support.
Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords
import nltk
import re
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/input/THEASHISHGAVADE'):
for filename in filenames:
print(os.path.join(dirname, filename))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
reviews_data = pd.read_csv("/input/THEASHISHGAVADE/kindle-reviews/ashish_kindle_reviews.csv")
reviews_data.shape
reviews_data.columns
reviews_data.head()
# Renaming Columns (Names)
reviews_data.rename(columns={'asin':'ProductId', 'overall':'Score', 'reviewText':'Tex
t', 'reviewerID':'UserId', 'summary':'Summary', 'unixReviewTime':'Time'}, inplace=Tru
e)
reviews_data.head()
# Drop Unwanted Columns
reviews_data.drop(['Unnamed: 0','helpful','reviewTime','reviewerName'], axis=1, inpla
ce=True)
reviews_data.head()
# Distribution of Reviews Per Score
category_dist = reviews_data['Score'].value_counts()
plt.figure(figsize=(10,6))
my_colors = ['g', 'r', 'b', 'm', 'y']
category_dist.plot(kind='bar', color=my_colors)
plt.grid()
plt.xlabel("Scores")
plt.ylabel("Number of Reviews Per Score")
plt.title("Distribution of Reviews Per Score")
plt.show()
reviews_data = reviews_data.loc[reviews_data['Score'] != 3]
reviews_data.shape
#give reviews with Score > 3 a positive rating and reviews with a score < 3 a negativ
e rating
def partition(x):
if x < 3:
return 'Negative'
else:
return 'Positive'
actualScore = reviews_data['Score']
pos_neg = actualScore.map(partition)
reviews_data['Score'] = pos_neg
reviews_data.head()
reviews_data = reviews_data.head(50000) #considering only 50k rows
category_dist = reviews_data['Score'].value_counts()
plt.figure(figsize=(10,6))
my_colors = ['g', 'r']
category_dist.plot(kind='bar', color=my_colors)
plt.grid()
plt.xlabel("Scores")
plt.ylabel("Number of Reviews")
plt.title("Distribution of Reviews")
plt.show()
reviews_data[reviews_data['UserId']=='A3SPTOKDG7WBLN']
#sorting data according to ProductId in ascending order
reviews_data = reviews_data.sort_values('ProductId', axis=0, ascending=True, inplace=
False, kind='quicksort', na_position='last')
#deduplication of entries
reviews_data = reviews_data.drop_duplicates(subset={"ProductId","UserId","Time","Text
"}, keep='first', inplace=False)
reviews_data.shape
# Data Preprocessing
#loading_the_stop_words_from_nltk_library_
stop_words = set(stopwords.words('english'))
def txt_preprocessing(total_text, index, column, df):
if type(total_text) is not int:
string = ""
#replace_every_special_char_with_space
total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
#replace_multiple_spaces_with_single_space
total_text = re.sub('\s+',' ', total_text)
#converting_all_the_chars_into_lower_case
total_text = total_text.lower()
for word in total_text.split():
#if_the_word_is_a_not_a_stop_word_then_retain_that_word_from_the_data
if not word in stop_words:
string += word + " "
df[column][index] = string
for index, row in reviews_data.iterrows():
if type(row['Text']) is str:
txt_preprocessing(row['Text'], index, 'Text', reviews_data)
else:
print("THERE IS NO TEXT DESCRIPTION FOR ID :",index)
reviews_data.head()
#checking null values
reviews_data.isna().sum()
#removing null values(row)
reviews_data.dropna(axis=0, inplace=True)
reviews_data.isna().sum()
reviews_data.shape
reviews_data.tail()
reviews_data['Score'].value_counts()
from sklearn.utils import resample
#create two different dataframe of majority and minority class
cls_majority = reviews_data[(reviews_data['Score']=='Positive')]
cls_minority = reviews_data[(reviews_data['Score']=='Negative')]
# upsample minority class
cls_minority_upsampled = resample(cls_minority,
replace=True,
#sample with replacement
n_samples= 44381, #to match majority class
random_state=42) #reproducible results
# Combine majority class with upsampled minority class
upsampled_data = pd.concat([cls_minority_upsampled, cls_majority])
upsampled_data.head()
upsampled_data.shape
upsampled_data['Score'].value_counts()
# Train Test Split
from sklearn.model_selection import train_test_split
X = upsampled_data['Text']
Y = upsampled_data['Score']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, stratify=Y,
random_state=0)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.20, str
atify=y_train, random_state=0)
print("NUMBER OF DATA POINTS IN TRAIN DATA :", X_train.shape[0])
print("NUMBER OF DATA POINTS IN CROSS VALIDATION DATA :", X_cv.shape[0])
print("NUMBER OF DATA POINTS IN TEST DATA :", X_test.shape[0])
# TF-IDF
#perform_tfidf_vectorization_of_text_data
from sklearn.feature_extraction.text import TfidfVectorizer
text_vec = TfidfVectorizer(min_df=10, max_features=5000)
text_vec.fit(X_train.values)
train_text = text_vec.transform(X_train.values)
test_text = text_vec.transform(X_test.values)
cv_text = text_vec.transform(X_cv.values)
print("Shape of Matrix - TFIDF")
print(train_text.shape)
print(test_text.shape)
print(cv_text.shape)
# Confusion / Precision / Recall Matrix
#this_function_plots_the_confusion_matrices_given_y_i_and_y_i_hat_
from sklearn.metrics import confusion_matrix
import seaborn as sns
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y) #confusion_mat
A =(((C.T)/(C.sum(axis=1))).T) #recall_mat
B =(C/C.sum(axis=0)) #precision_mat
labels = [0,1]
#representing_C_in_heatmap_format
print("-"*40, "Confusion Matrix", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(C, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
#representing_B_in_heatmap_format
print("-"*40, "Precision Matrix (Columm Sum=1)", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(B, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
#representing_A_in_heatmap_format
print("-"*40, "Recall Matrix (Row Sum=1)", "-"*40)
plt.figure(figsize=(8,5))
sns.heatmap(A, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab
els=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
# Logistic Regression Model
#train a logistic regression + calibration model using text features which are tfidf
encoded
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
alpha = [10 ** x for x in range(-5, 1)]
cv_log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
clf.fit(train_text, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(train_text, y_train)
predict_y = sig_clf.predict_proba(cv_text)
cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e15))
print('For Values of Alpha =',i,"The Log Loss is:",log_loss(y_cv, predict_y, labe
ls=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array, c='r')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for Each Alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error Measure")
plt.show()
best_alpha = np.argmin(cv_log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=4
2)
clf.fit(train_text, y_train)
lr_sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
lr_sig_clf.fit(train_text, y_train)
predict_y = lr_sig_clf.predict_proba(train_text)
print('For Values of Best Alpha =', alpha[best_alpha],"The Train Log Loss is:",log_lo
ss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = lr_sig_clf.predict_proba(test_text)
print('For Values of Best Alpha =', alpha[best_alpha],"The Test Log Loss is:",log_los
s(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = lr_sig_clf.predict_proba(cv_text)
print('For Values of Best Alpha =', alpha[best_alpha],"The Cross Validation Log Loss
is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
lr_train_accuracy = (lr_sig_clf.score(train_text, y_train)*100)
lr_test_accuracy = (lr_sig_clf.score(test_text, y_test)*100)
lr_cv_accuracy = (lr_sig_clf.score(cv_text, y_cv)*100)
print("Logistic Regression Train Accuracy -",lr_train_accuracy)
print("Logistic Regression Test Accuracy -",lr_test_accuracy)
print("Logistic Regression CV Accuracy -",lr_cv_accuracy)
plot_confusion_matrix(y_cv, lr_sig_clf.predict(cv_text.toarray()))
# Predict - Test Data
test_pred = lr_sig_clf.predict(test_text)
from sklearn.metrics import classification_report
print(classification_report(y_test, test_pred))
test_pred_list = test_pred.tolist()
test_pred_list[:5]
final_test_df = pd.DataFrame({'Text':X_test, 'Review':test_pred_list})
final_test_df.head(10)
final_test_df.values[5]
Output:
For Values of Best Alpha = 1e-05 The Train Log Loss is: 0.15253560593949955
For Values of Best Alpha = 1e-05 The Test Log Loss is: 0.18438127714047778
For Values of Best Alpha = 1e-05 The Cross Validation Log Loss is: 0.1874222604802398
----------------------------------- Confusion Matrix ----------------------------------------
----------------------------- Recall Matrix (Row Sum=1) ------------------------------
Explanation:
The code is an implementation of sentiment analysis on Kindle product reviews. The dataset contains
information about Kindle products, including the reviews given by customers for each product. The code
preprocesses the data by removing unwanted columns, cleaning the text, handling null values, and balancing the
dataset. The balanced dataset is then split into training, validation, and testing sets.
The sentiment analysis is done using a machine learning model. However, the code provided does not
include the model building and training steps. Therefore, it is unclear what model or algorithm was used for the
analysis. Nonetheless, the code provides a good foundation for data preprocessing, which is essential for any
machine learning project.
The code uses the Pandas library for data manipulation and Matplotlib for data visualization. It also uses
the NLTK library to remove stop words and perform text cleaning. Finally, it uses Scikit-learn's train_test_split
and resample functions for dataset splitting and balancing, respectively.
Conclusion:
Developing social media text analytics models for improving existing products/services by analyzing
customer reviews/comments is crucial for businesses to improve customer satisfaction, gain new insights, and
identify areas for improvement.
It involves collecting customer reviews/comments, preprocessing the data, performing sentiment
analysis, topic modeling, and opinion mining, identifying areas for improvement, and taking necessary actions.
By performing effective text analytics, businesses can gain a competitive edge, improve customer loyalty, and
achieve their business goals.
Download