EXPERIMENT NO. 1. Aim: Study various i) ii) iii) iv) Social Media platforms (Facebook, twitter, YouTube etc.) Social Media analytics tools (Facebook insights, google analytics net lytic etc.) Social Media Analytics techniques and engagement metrics (page level, post level, member level) Applications of Social media analytics for business. e.g. Google Analytics https://marketingplatform.google.com/about/analytics/ https://netlytic.org Theory: Social media has become an indispensable part of our lives. It has transformed the way we communicate and interact with each other. Social media platforms like Facebook, Twitter, and YouTube have millions of users who share their views, opinions, and experiences. Businesses have recognized the importance of social media in engaging with their customers and promoting their products and services. Social media analytics is the process of analyzing social media data to gain insights into customer behavior, market trends, and brand perception. In this article, we will study various social media platforms, social media analytics tools, and techniques, and their applications in business. Social Media Platforms: There are various social media platforms, and each has its own unique features and characteristics. Some of the popular social media platforms are Facebook, Twitter, Instagram, LinkedIn, and YouTube. Facebook is the most popular social media platform, with over 2 billion monthly active users. Twitter is a microblogging platform that allows users to post short messages called tweets. Instagram is a visual platform that focuses on photos and videos. LinkedIn is a professional networking platform that connects professionals and businesses. YouTube is a video-sharing platform that allows users to upload and share videos. Social Media Analytics Tools: Social media analytics tools are software applications that help businesses track, measure, and analyze social media data. Some of the popular social media analytics tools are Facebook Insights, Google Analytics, and Netlytic. Facebook Insights is a free tool that provides businesses with data on their Facebook page's performance. Google Analytics is a web analytics tool that tracks and reports website traffic. Netlytic is a social media analytics tool that analyzes social media data to identify trends, topics, and influencers. Social Media Analytics Techniques and Engagement Metrics: Social media analytics techniques and engagement metrics are used to measure social media performance. Some of the popular social media analytics techniques and engagement metrics are pagelevel metrics, post-level metrics, and member-level metrics. Page-level metrics measure the overall performance of a social media page. Post-level metrics measure the performance of individual posts. Member-level metrics measure the engagement level of individual members. Applications of Social Media Analytics for Business: Social media analytics has various applications in business. It can be used to measure brand awareness, customer engagement, and market trends. It can also be used to identify influencers and track competitor activity. Google Analytics is a popular tool used by businesses to track website traffic and user behavior. It provides businesses with insights into their website's performance, such as the number of visitors, bounce rate, and conversion rate. Netlytic is a social media analytics tool that helps businesses identify trends, topics, and influencers in social media data. Conclusion: Social media analytics has become an essential tool for businesses to measure their social media performance and gain insights into customer behavior, market trends, and brand perception. Social media platforms like Facebook, Twitter, and YouTube have millions of users, and businesses can leverage this audience to promote their products and services. Social media analytics tools like Facebook Insights, Google Analytics, and Netlytic provide businesses with data and insights to make informed decisions. Social media analytics techniques and engagement metrics can help businesses measure their social media performance and optimize their social media strategy. EXPERIMENT NO. 2. Aim: Data Collection-Select the social media platforms of your choice (Twitter, Facebook, LinkedIn, YouTube, Web blogs etc) ,connect to and capture social media data for business ( scraping, crawling, parsing). Introduction: Social media has become a crucial source of data for businesses to understand their customers and their preferences. With millions of users sharing their thoughts, opinions, and experiences on social media platforms like Twitter, Facebook, LinkedIn, YouTube, and web blogs, businesses can gain valuable insights into customer behavior, market trends, and brand perception. In this article, we will discuss data collection methods for social media platforms, including scraping, crawling, and parsing. Data Collection Methods: There are various data collection methods for social media platforms, and each method has its own advantages and disadvantages. Some of the popular data collection methods are scraping, crawling, and parsing. 1. Scraping: Scraping is the process of extracting data from websites or social media platforms. Scraping is an effective way to collect data from social media platforms like Twitter, Facebook, and LinkedIn. The data collected through scraping can be used for various purposes, such as sentiment analysis, trend analysis, and competitor analysis. 2. Crawling: Crawling is the process of discovering and indexing web pages through search engines. It involves using automated tools or software to scan websites and social media platforms for new content. Crawling is an effective way to collect data from web blogs and news websites. The data collected through crawling can be used for various purposes, such as content analysis, trend analysis, and competitor analysis. 3. Parsing: Parsing is the process of extracting data from structured or unstructured data sources. It involves using automated tools or software to analyze data and extract relevant information. Parsing is an effective way to collect data from social media platforms like Twitter and Facebook. The data collected through parsing can be used for various purposes, such as sentiment analysis, trend analysis, and competitor analysis. Execution Steps: 1. Identify the social media platform(s) of interest: Choose the social media platform(s) that you want to collect data from, based on your business needs and objectives. 2. Set up a data collection tool: Choose a data collection tool that suits your needs and set it up. There are various data collection tools available, such as Octoparse, ParseHub, and Beautiful Soup. 3. Define the data fields: Define the data fields that you want to collect from the social media platform. For example, if you are collecting data from Twitter, you may want to collect data on tweets, hashtags, user profiles, and location. 4. Configure the data collection tool: Configure the data collection tool to collect the data fields that you have defined. 5. Run the data collection tool: Run the data collection tool to collect the data from the social media platform. 6. Store and analyze the data: Store the data in a database or spreadsheet and analyze it using data analysis tools like Excel or Python. Code: # Import required libraries import pandas as pd import requests from textblob import TextBlob # Set YouTube video ID, maximum number of comments to retrieve, and API key video_id = "Q33TkQKlIMg" max_result = 50 api_key ="AIzaSyC_4xZTiNuz1O-Qu5kYnlg82riP30KRIxY" # Retrieve video information video_info_url = f"https://www.googleapis.com/youtube/v3/videos?part=id%2C+snippet&id =Q33TkQKlIMg&key={api_key}" video_info_response = requests.get(video_info_url) video_info_data = video_info_response.json() # Retrieve video components (comments) comments_url = f"https://www.googleapis.com/youtube/v3/commentThreads?key={api_key}&v ideoId={video_id}&part=snippet&maxResults={max_result}" comments_response = requests.get(comments_url) comments_data = comments_response.json() # Create pandas DataFrame from comments data df = pd.DataFrame(comments_data['items']) df1 = pd.DataFrame(df['snippet']) # Extract comments from DataFrame txt="" comments = [] for i in range(0,50): df2 = pd.DataFrame(df['snippet'][i]) txt = df2['topLevelComment']['snippet']['textOriginal'] comments.append(txt) print(comments) # Define function to perform sentiment analysis on a given comment def get_comment_sentiment(comment): analysis = TextBlob(comment) if analysis.sentiment.polarity > 0: return "Positive" elif analysis.sentiment.polarity == 0: return "neutral" else: return "negative" # Perform sentiment analysis on all comments and create a new DataFrame comment_list = [] sentiment_list = [] for comment in comments: sentiment = get_comment_sentiment(comment) comment_list.append(comment) sentiment_list.append(sentiment) print(f"{comment} : {sentiment}") sentiment_df = pd.DataFrame({"Comments": comment_list,"Sentiment": sentiment_list}) sentiment_df.head() # Save DataFrame to a CSV file sentiment_df.to_csv("Ashish_YouTube_Comments_Sentiment.csv") Output: golden voice morgan freeman : Positive It's shows only our earth not Universe : neutral "Little minds are tamed and subdued by misfortune; but great minds rise above it." -Washington Irving : Positive "I find that when you have a real interest in life and a curious life, that sleep is not the most important thing." --Martha Stewart : Positive But they can't beat the @melodysheep : neutral Hore..hore film ultramen!! : neutral I mean.. life is strange : negative So amazing 😢 : Positive N for nothing original : Positive His narration is institutionalized. ❤❤❤❤ : neutral Morgan Freeman u beauty ❤ : neutral All most of pieces come from France television watch a doc called "aux Frontier de l'univers "I bet you will like it but also Morgan he ' s one of the best : Positive Absolute poppycock! It makes me love God all the more! Trying to explain the universe while claiming God doesn’t exist….. Let’s just say that in heaven, they watch shows like this on comedy night. : Positive No Money, No Honey. Blow up the damn universe. Who cares. : neutral Ashish attenborough would’ve ate this up : neutral And humans are biggest mistake by nature 😂😂😂😂 : neutral I wish this could be not only on netflix and on other websites in the internet to watch it or on tv in Poland where I live. I love Mr Freeman voice and science shows I watched with him : Positive I got goosebumps watching this trailer. : neutral Half of trailer copy from national geographic channel😅 : negative its not a documentry. its an exploration of theoretical events. : neutral damn, the big bang arc hit hard. when theia-chan and earth-kun finally joined together i was crying : negative This proved to me that we have a CREATOR, this world was not made by chance, it was intentional and perfect. : Positive God is the creator, the universe things you put in here don't make sense, had to skip the parts : neutral Kudos to tech , We can see back the full history : Positive This is not the real story..... : Positive Seeing ur work gettin some recognition bring tears : neutral watch The throne of Allah. : neutral Honestly was praying NOT to hear Morgan Freeman : Positive Is this the history of our universe or it's the history that you made up? : neutral The earth is flat : negative And also don't forget world's elite have an agenda to believe lies. GOD created everything keep in mind, and this universe GOD laws still exist. Amen : neutral Gurl y’all got God to narrate this lmaoo : neutral BIDEN R U WATCHING THIS ? : neutral Надеюсь нам не будут напоминать что планеты бесполые... : neutral Watching the creation of our Lord God Almighty. He created all things and through the Lord Jesus were made. The world was made in 6 days 💯❤✝️ : neutral Looks so good : Positive The voice of ''MORGAN FREEMAN'' is not just a voice it's an emotion that connects us with him and also connects us with the scenes. ❤❤❤ : neutral Wow : Positive 13.8 billion years, says who? Allah states that He created the universe, the heavens and earth and all that is in, on and between them in six days, as He has stated in several Ayat in the Qur'an. : neutral I get nostalgia of animal planet : neutral Why it feels more like some discovery stuff then a Netflix show😅 : Positive This is explosive ❤❤ : neutral Ashish.... epic voice : Positive This line 👉We are connected to the start ❤ Yes actually we are the start and last i think, what do you think? Different animals have different mind but anyone want to live with their mothers(🌎), but we Are humans we are Destroyed trees, killing animals, increasing population, waste water , waste food, polute air,😔😓 etc and want to live long ..... How funny that you guys 😅 : Positive Proud of Indian 🚩🇮🇳 : Positive That's why I subscribe Netflix : neutral Morgan Freeman is GOD : neutral me suena a una imitación de cosmos espero que este buena : neutral Explanation: This code is for retrieving comments from a YouTube video and performing sentiment analysis on the comments using the TextBlob library in Python. The code imports pandas, requests, and TextBlob libraries, and sets the video ID, maximum number of comments to retrieve, and API key for accessing the YouTube API. The code then sends an API request to retrieve information about the video and another API request to retrieve the comments for the video, using the video ID and API key. The retrieved data is then converted into a pandas DataFrame for further analysis. The code then extracts the comments from the DataFrame and performs sentiment analysis on each comment using the TextBlob library. Finally, the code creates a new pandas DataFrame with the comments and their corresponding sentiment and saves the data to a CSV file named "Ashish_YouTube_Comments_Sentiment.csv". Conclusion: Data collection from social media platforms is essential for businesses to gain insights into customer behavior, market trends, and brand perception. There are various data collection methods available, including scraping, crawling, and parsing. Each method has its own advantages and disadvantages, and businesses should choose the method that suits their needs and objectives. The execution steps for data collection involve identifying the social media platform(s) of interest, setting up a data collection tool, defining the data fields, configuring the data collection tool, running the data collection tool, and storing and analyzing the data. EXPERIMENT NO. 3. Aim: Data Cleaning and Storage- Preprocess, filter and store social media data for business (Using Python, MongoDB, R, etc). Theory: After collecting social media data for businesses, the next step is data cleaning and storage. The data collected from social media platforms is often unstructured and noisy, making it difficult to analyze. In this article, we will discuss the theory behind data cleaning and storage and the execution steps to preprocess, filter, and store social media data for businesses. Data cleaning is the process of identifying and correcting errors and inconsistencies in data. Social media data is often noisy, containing irrelevant or inaccurate information, such as misspellings, abbreviations, and emoticons. Data cleaning involves removing such information to make the data more accurate and reliable. Data storage involves storing the cleaned data in a database or other storage systems. Social media data is often big and unstructured, making it difficult to store in traditional databases. Therefore, businesses can use NoSQL databases, such as MongoDB or Apache Cassandra, to store social media data. Execution Steps: 1. Preprocess the data: Preprocess the data by removing irrelevant or inaccurate information, such as misspellings, abbreviations, and emoticons. Businesses can use data cleaning tools, such as TextBlob or NLTK, to preprocess social media data. 2. Filter the data: Filter the data by selecting only relevant data fields, such as tweets, hashtags, user profiles, and location. Businesses can use filtering tools, such as pandas or dplyr, to filter social media data. 3. Store the data: Store the cleaned and filtered data in a database or other storage systems. Businesses can use NoSQL databases, such as MongoDB or Apache Cassandra, to store social media data. 4. Index the data: Index the data to make it searchable and retrievable. Businesses can use indexing tools, such as Elasticsearch or Solr, to index social media data. 5. Analyze the data: Analyze the data using data analysis tools, such as Excel, Python, or R. Businesses can use data analysis tools to gain insights into customer behavior, market trends, and brand perception. Code: # Import libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction.text import CountVectorizer import nltk import string import re %matplotlib inline pd.set_option('display.max_colwidth', 100) # Load dataset def load_data(): data = pd.read_csv('../input/Ashish_Data.csv') return data tweet_df = load_data() tweet_df.head() print('Dataset size:',tweet_df.shape) print('Columns are:',tweet_df.columns) tweet_df.info() sns.countplot(x = 'ADR_label', data = tweet_df) # Exploratory Data Analysis # Wordcloud Visualization df = pd.DataFrame(tweet_df[['UserId', 'Tweet']]) from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator # Start with one review: df_ADR = tweet_df[tweet_df['ADR_label']==1] df_NADR = tweet_df[tweet_df['ADR_label']==0] tweet_All = " ".join(review for review in df.Tweet) tweet_ADR = " ".join(review for review in df_ADR.Tweet) tweet_NADR = " ".join(review for review in df_NADR.Tweet) fig, ax = plt.subplots(3, 1, figsize = (30,30)) # Create and generate a word cloud image: wordcloud_ALL = WordCloud(max_font_size=50, max_words=100, background_color="white"). generate(tweet_All) wordcloud_ADR = WordCloud(max_font_size=50, max_words=100, background_color="white"). generate(tweet_ADR) wordcloud_NADR = WordCloud(max_font_size=50, max_words=100, background_color="white") .generate(tweet_NADR) # Display the generated image: ax[0].imshow(wordcloud_ALL, interpolation='bilinear') ax[0].set_title('All Tweets', fontsize=30) ax[0].axis('off') ax[1].imshow(wordcloud_ADR, interpolation='bilinear') ax[1].set_title('Tweets under ADR Class',fontsize=30) ax[1].axis('off') ax[2].imshow(wordcloud_NADR, interpolation='bilinear') ax[2].set_title('Tweets under None - ADR Class',fontsize=30) ax[2].axis('off') #wordcloud.to_file("img/first_review.png") string.punctuation def remove_punct(text): text = "".join([char for char in text if char not in string.punctuation]) text = re.sub('[0-9]+', '', text) return text df['Tweet_punct'] = df['Tweet'].apply(lambda x: remove_punct(x)) df.head(10) # Tokenization def tokenization(text): text = re.split('\W+', text) return text df['Tweet_tokenized'] = df['Tweet_punct'].apply(lambda x: tokenization(x.lower())) df.head() # Remove stopwords stopword = nltk.corpus.stopwords.words('english') #stopword.extend(['yr', 'year', 'woman', 'man', 'girl','boy','one', 'two', 'sixteen', 'yearold', 'fu', 'weeks', 'week', # 'treatment', 'associated', 'patients', 'may','day', 'case','old']) def remove_stopwords(text): text = [word for word in text if word not in stopword] return text df['Tweet_nonstop'] = df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x)) df.head(10) # Stemming and Lammitization # Ex - developed, development ps = nltk.PorterStemmer() def stemming(text): text = [ps.stem(word) for word in text] return text df['Tweet_stemmed'] = df['Tweet_nonstop'].apply(lambda x: stemming(x)) df.head() wn = nltk.WordNetLemmatizer() def lemmatizer(text): text = [wn.lemmatize(word) for word in text] return text df['Tweet_lemmatized'] = df['Tweet_nonstop'].apply(lambda x: lemmatizer(x)) df.head() def clean_text(text): text_lc = "".join([word.lower() for word in text if word not in string.punctuatio n]) # remove puntuation text_rc = re.sub('[0-9]+', '', text_lc) tokens = re.split('\W+', text_rc) # tokenization text = [ps.stem(word) for word in tokens if word not in stopword] # remove stopw ords and stemming return text # Vectorisation countVectorizer = CountVectorizer(analyzer=clean_text) countVector = countVectorizer.fit_transform(df['Tweet']) print('{} Number of tweets has {} words'.format(countVector.shape[0], countVector.sha pe[1])) #print(countVectorizer.get_feature_names()) count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_featu re_names()) #count_vect_df.head() # Feature Creation ADR_tweet_1 = tweet_df[tweet_df['ADR_label'] == 1]['Tweet'].apply(lambda x: len(x) len(' ')) ADR_tweet_0 = tweet_df[tweet_df['ADR_label'] == 0]['Tweet'].apply(lambda x: len(x) len(' ')) bins_ = np.linspace(0, 450, 70) plt.hist(ADR_tweet_1, bins= bins_, normed=True, alpha = 0.5, label = 'ADR') plt.hist(ADR_tweet_0, bins= bins_, normed=True, alpha = 0.1, label = 'None_ADR') plt.legend() Output: 23516 Number of tweets has 14323 words Explanation: This code is written in Python and is used for Exploratory Data Analysis (EDA) on a dataset containing tweets related to drugs. Here's a summary of the code: Libraries such as pandas, numpy, seaborn, matplotlib, and nltk are imported. The dataset is loaded using the load_data function, and the size and columns of the dataset are printed. A countplot is plotted to show the number of tweets in each ADR class (0 or 1). A WordCloud is plotted to visualize the most frequent words in all tweets, tweets under ADR class, and tweets under None-ADR class. The tweets are preprocessed by removing punctuation, tokenizing the words, removing stopwords, and stemming and lemmatizing the words. The preprocessed tweets are vectorized using the CountVectorizer class. The length of tweets in both ADR classes is plotted using histograms to show the frequency distribution of tweet lengths. The code performs the EDA on the tweet dataset to gain insights and prepare the data for machine learning models that can predict ADR classes of tweets based on their contents. Conclusion: Data cleaning and storage are essential steps in the social media data analysis process. The data collected from social media platforms is often unstructured and noisy, making it difficult to analyze. Therefore, businesses need to preprocess, filter, and store social media data to make it more accurate and reliable. They can use data cleaning tools, such as TextBlob or NLTK, and filtering tools, such as pandas or dplyr, to preprocess and filter social media data. Businesses can also use NoSQL databases, such as MongoDB or Apache Cassandra, to store social media data. Finally, businesses can use data analysis tools, such as Excel, Python, or R, to gain insights into customer behavior, market trends, and brand perception. EXPERIMENT NO. 4. Aim: Exploratory Data Analysis and visualization of Social Media Data for business. Theory: Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations. An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent. The four types of EDA are 1. Univariate non-graphical, 2. Multivariate non-graphical, 3. Univariate graphical, 4. Multivariate graphical. Techniques and Tools: There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques. Typical graphical techniques used in EDA are: Box plot Histogram Multi-vari chart Run chart Pareto chart Scatter plot (2D/3D) Stem-and-leaf plot Parallel coordinates Odds ratio Heat map Bar chart Horizon graph Dimensionality reduction: Multidimensional scaling Principal component analysis (PCA) Multilinear PCA Iconography of correlations Code: # Import necessary libraries import pandas as pd # For data manipulation and analysis import numpy as np # For numerical computing import matplotlib.pyplot as plt # For data visualization import seaborn as sns # For advanced data visualization # Load the social media data into a pandas DataFrame social_media_data = pd.read_csv('ashish_social_media_data.csv') # Explore the data using summary statistics print(social_media_data.describe()) # Visualize the data using histograms and box plots plt.figure(figsize=(10,6)) sns.histplot(social_media_data['likes'], kde=True) # Histogram of the 'likes' column plt.title('Distribution of Likes') # Add title plt.xlabel('Number of Likes') # Add x-axis label plt.ylabel('Frequency') # Add y-axis label plt.show() plt.figure(figsize=(10,6)) sns.boxplot(x='platform', y='followers', data=social_media_data) # Box plot of 'follo wers' column by 'platform' column plt.title('Followers by Platform') # Add title plt.xlabel('Platform') # Add x-axis label plt.ylabel('Number of Followers') # Add y-axis label plt.show() # Identify correlations between variables corr_matrix = social_media_data.corr() # Create a correlation matrix sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') # Heatmap of the correlation ma trix plt.title('Correlation Matrix') # Add title plt.show() # Visualize the data using scatter plots plt.figure(figsize=(10,6)) sns.scatterplot(x='comments', y='engagement_rate', data=social_media_data) # Scatter plot of 'comments' column vs 'engagement_rate' column plt.title('Comments vs Engagement Rate') # Add title plt.xlabel('Number of Comments') # Add x-axis label plt.ylabel('Engagement Rate') # Add y-axis label plt.show() # Identify outliers using box plots plt.figure(figsize=(10,6)) sns.boxplot(x='platform', y='likes', data=social_media_data) # Box plot of 'likes' co lumn by 'platform' column plt.title('Likes by Platform') # Add title plt.xlabel('Platform') # Add x-axis label plt.ylabel('Number of Likes') # Add y-axis label plt.show() # Create a pairplot to visualize the relationships between all variables sns.pairplot(data=social_media_data) # Pairplot of all columns plt.show() Output: count mean std min 25% 50% 75% max followers 10.000000 13200.000000 9199.033766 2000.000000 6250.000000 10000.000000 18750.000000 30000.000000 likes 10.000000 700.000000 637.268476 50.000000 262.500000 500.000000 950.000000 2000.000000 comments 10.000000 46.500000 35.043624 5.000000 21.250000 40.000000 68.750000 100.000000 engagement_rate 10.0000 4.4500 1.9352 2.0000 2.6250 4.5000 5.8250 7.5000 Explanation: This code performs exploratory data analysis on a social media dataset. It first imports necessary libraries, loads the dataset into a Pandas DataFrame, and then uses summary statistics and data visualization techniques such as histograms, box plots, scatter plots, and correlation matrices to explore the dataset. The code also identifies outliers using box plots and creates a pairplot to visualize the relationships between all variables in the dataset. Overall, the code helps to gain insights and better understand the social media data. Conclusion: Exploratory Data Analysis and visualization of social media data are important steps in the data analysis process. EDA helps businesses to understand the patterns and trends in social media data and identify relationships between variables. Visualization is an important part of EDA because it helps businesses to communicate insights and findings effectively. Businesses can use Python, R, or other data analysis tools to perform EDA and create visualizations of social media data. Finally, businesses can interpret the findings from EDA and communicate the insights to relevant stakeholders. EXPERIMENT NO. 5. Aim: Develop Content (text, emoticons, image, audio, video) based social media analytics model for business. (E.g. Content Based Analysis: Topic, Issue, Trend, sentiment/opinion analysis, audio, video, image analytics). Theory: Developing a content-based social media analytics model for businesses is a crucial step in understanding their audience and improving their social media presence. In this article, we will discuss the theory behind content-based analysis, the execution steps to develop the model, and the benefits it can offer for businesses. Content-based social media analytics is a process of analyzing the content shared on social media platforms by a business or its customers. It helps businesses to understand their audience, identify emerging trends, and improve their social media strategy. Content-based analysis can be used for text, emoticons, image, audio, and video analytics. Topic Analysis: It is the process of identifying the topics and themes in social media content. This can help businesses to understand the interests of their audience and create content that resonates with them. Issue Analysis: It is the process of identifying the issues and concerns of the audience. This can help businesses to address the concerns of their audience and improve their social media presence. Trend Analysis: It is the process of identifying the emerging trends in social media content. This can help businesses to stay up-to-date with the latest trends and adapt their social media strategy accordingly. Sentiment Analysis: It is the process of identifying the sentiment or opinion expressed in social media content. This can help businesses to understand the sentiment of their audience towards their brand, products, or services. Audio, Video, and Image Analytics: It is the process of analyzing audio, video, and image content shared on social media platforms. This can help businesses to identify the type of content that resonates with their audience and create similar content to improve engagement. Execution Steps: 1. Data Collection: Collect data from social media platforms, including text, emoticons, images, audio, and video content. 2. Preprocessing: Clean and preprocess the data by removing stop words, converting text to lowercase, and removing special characters. 3. Feature Extraction: Extract relevant features from the data, such as topics, issues, trends, sentiment, and image, audio, and video content. 4. Model Development: Develop a machine learning model to analyze the extracted features and provide insights about the content shared on social media platforms. 5. Model Evaluation: Evaluate the performance of the model using metrics such as accuracy, precision, and recall. 6. Implementation: Implement the model to analyze the content shared on social media platforms and provide insights to improve the social media strategy. Code: # DataFrame import pandas as pd # Matplot import matplotlib.pyplot as plt %matplotlib inline # Scikit-learn from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix, classification_report, accuracy_score from sklearn.manifold import TSNE from sklearn.feature_extraction.text import TfidfVectorizer # Keras from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxP ooling1D, LSTM from keras import utils from keras.callbacks import ReduceLROnPlateau, EarlyStopping # nltk import nltk from nltk.corpus import stopwords from nltk.stem import SnowballStemmer # Word2vec import gensim # Utility import re import numpy as np import os from collections import Counter import logging import time import pickle import itertools # Set log logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging .INFO) nltk.download('stopwords') # Settings # DATASET DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"] DATASET_ENCODING = "ISO-8859-1" TRAIN_SIZE = 0.8 # TEXT CLENAING TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+" # WORD2VEC W2V_SIZE = 300 W2V_WINDOW = 7 W2V_EPOCH = 32 W2V_MIN_COUNT = 10 # KERAS SEQUENCE_LENGTH = 300 EPOCHS = 8 BATCH_SIZE = 1024 # SENTIMENT POSITIVE = "POSITIVE" NEGATIVE = "NEGATIVE" NEUTRAL = "NEUTRAL" SENTIMENT_THRESHOLDS = (0.4, 0.7) # EXPORT KERAS_MODEL = "model.h5" WORD2VEC_MODEL = "model.w2v" TOKENIZER_MODEL = "tokenizer.pkl" ENCODER_MODEL = "encoder.pkl" dataset_filename = os.listdir("../input")[0] dataset_path = os.path.join("..","input",dataset_filename) print("Open file:", dataset_path) df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS) print("Dataset size:", len(df)) # df.head(5) # Map target label to String¶ # 0 -> NEGATIVE # 2 -> NEUTRAL # 4 -> POSITIVE decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"} def decode_sentiment(label): return decode_map[int(label)] %%time df.target = df.target.apply(lambda x: decode_sentiment(x)) target_cnt = Counter(df.target) plt.figure(figsize=(16,8)) plt.bar(target_cnt.keys(), target_cnt.values()) plt.title("Dataset labels distribuition") # Pre-Process dataset stop_words = stopwords.words("english") stemmer = SnowballStemmer("english") def preprocess(text, stem=False): # Remove link,user and special characters text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip() tokens = [] for token in text.split(): if token not in stop_words: if stem: tokens.append(stemmer.stem(token)) else: tokens.append(token) return " ".join(tokens) %%time df.text = df.text.apply(lambda x: preprocess(x)) df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42) print("TRAIN size:", len(df_train)) print("TEST size:", len(df_test)) # Word2Vec %%time documents = [_text.split() for _text in df_train.text] w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT, workers=8) w2v_model.build_vocab(documents) words = w2v_model.wv.vocab.keys() vocab_size = len(words) print("Vocab size", vocab_size) %%time w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH) w2v_model.most_similar("love") %%time tokenizer = Tokenizer() tokenizer.fit_on_texts(df_train.text) vocab_size = len(tokenizer.word_index) + 1 print("Total words", vocab_size) %%time x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_ LENGTH) x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LE NGTH) # Label Encoder labels = df_train.target.unique().tolist() labels.append(NEUTRAL) labels encoder = LabelEncoder() encoder.fit(df_train.target.tolist()) y_train = encoder.transform(df_train.target.tolist()) y_test = encoder.transform(df_test.target.tolist()) y_train = y_train.reshape(-1,1) y_test = y_test.reshape(-1,1) print("y_train",y_train.shape) print("y_test",y_test.shape) print("x_train", x_train.shape) print("y_train", y_train.shape) print() print("x_test", x_test.shape) print("y_test", y_test.shape) y_train[:10] # Embedding layer embedding_matrix = np.zeros((vocab_size, W2V_SIZE)) for word, i in tokenizer.word_index.items(): if word in w2v_model.wv: embedding_matrix[i] = w2v_model.wv[word] print(embedding_matrix.shape) embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_l ength=SEQUENCE_LENGTH, trainable=False) # Build Model model = Sequential() model.add(embedding_layer) model.add(Dropout(0.5)) model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(1, activation='sigmoid')) model.summary() # Compile model model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy']) # Callbacks callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0), EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)] # Train %%time history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=0.1, verbose=1, callbacks=callbacks) # Evaluate %%time score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE) print() print("ACCURACY:",score[1]) print("LOSS:",score[0]) acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(len(acc)) plt.plot(epochs, acc, 'b', label='Training acc') plt.plot(epochs, val_acc, 'r', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.figure() plt.plot(epochs, loss, 'b', label='Training loss') plt.plot(epochs, val_loss, 'r', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show() # Predict def decode_sentiment(score, include_neutral=True): if include_neutral: label = NEUTRAL if score <= SENTIMENT_THRESHOLDS[0]: label = NEGATIVE elif score >= SENTIMENT_THRESHOLDS[1]: label = POSITIVE return label else: return NEGATIVE if score < 0.5 else POSITIVE def predict(text, include_neutral=True): start_at = time.time() # Tokenize text x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENG TH) # Predict score = model.predict([x_test])[0] # Decode sentiment label = decode_sentiment(score, include_neutral=include_neutral) return {"label": label, "score": float(score), "elapsed_time": time.time()-start_at} predict("I love the music") predict("I hate the rain") predict("i don't know what i'm doing") # Confusion Matrix %%time y_pred_1d = [] y_test_1d = list(df_test.target) scores = model.predict(x_test, verbose=1, batch_size=8000) y_pred_1d = [decode_sentiment(score, include_neutral=False) for score in scores] def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title, fontsize=30) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=90, fontsize=22) plt.yticks(tick_marks, classes, fontsize=22) fmt = '.2f' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.ylabel('True label', fontsize=25) plt.xlabel('Predicted label', fontsize=25) %%time cnf_matrix = confusion_matrix(y_test_1d, y_pred_1d) plt.figure(figsize=(12,12)) plot_confusion_matrix(cnf_matrix, classes=df_train.target.unique(), title="Confusion matrix") plt.show() # Classification Report print(classification_report(y_test_1d, y_pred_1d)) # Accuracy Score accuracy_score(y_test_1d, y_pred_1d) # Save model model.save(KERAS_MODEL) w2v_model.save(WORD2VEC_MODEL) pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0) pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0) Output: Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 300, 300) 87125700 _________________________________________________________________ dropout_1 (Dropout) (None, 300, 300) 0 _________________________________________________________________ lstm_1 (LSTM) (None, 100) 160400 _________________________________________________________________ dense_1 (Dense) (None, 1) 101 ================================================================= ACCURACY: 0.791134375 LOSS: 0.4442952796936035 CPU times: user 2min 25s, sys: 16.9 s, total: 2min 42s Wall time: 1min 52s CPU times: user 1.39 s, sys: 256 ms, total: 1.64 s Wall time: 1.38 s precision recall f1-score support NEGATIVE POSITIVE 0.79 0.79 0.79 0.80 0.79 0.79 159494 160506 micro avg macro avg weighted avg 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 320000 320000 320000 Explanation: This is a Python script for performing sentiment analysis on a Twitter dataset using machine learning techniques. The script imports various libraries such as pandas, matplotlib, scikit-learn, Keras, nltk, and gensim. The dataset is cleaned, pre-processed, and transformed into a format that can be used to train and test machine learning models. The script then trains a machine learning model using Keras, and the trained model is used to make predictions on the test set. The performance of the model is evaluated using various metrics such as confusion matrix, classification report, and accuracy score. Finally, the trained model is exported for future use. Conclusion: Content-based social media analytics is a crucial step in understanding the audience and improving the social media presence of businesses. It involves analyzing the content shared on social media platforms, including text, emoticons, images, audio, and video content, to identify topics, issues, trends, sentiment, and image, audio, and video content. Developing a content-based social media analytics model involves data collection, preprocessing, feature extraction, model development, model evaluation, and implementation. By using content-based social media analytics, businesses can improve their social media strategy and engage with their audience effectively. EXPERIMENT NO. 6. Aim: Develop Structure based social media analytics model for any business. (E.g. Structure Based Models community detection, influence analysis). Theory: Developing a structure-based social media analytics model for any business is a critical step in understanding the network of their audience and identifying influential users. In this article, we will discuss the theory behind structure-based models, the execution steps to develop the model, and the benefits it can offer for businesses. Structure-based social media analytics is a process of analyzing the social network structure of a business or its customers. It helps businesses to understand the relationships between their audience and identify influential users. Structure-based analysis can be used for community detection and influence analysis. Community Detection: It is the process of identifying groups or communities of users within a social network. This can help businesses to understand the interests and preferences of different user groups and create targeted content to improve engagement. Influence Analysis: It is the process of identifying influential users within a social network. This can help businesses to identify users who have a significant impact on their audience and engage with them to improve their social media presence. Execution Steps: 1. Data Collection: Collect data from social media platforms, including user profiles, followers, and interactions between users. 2. Network Construction: Construct a social network graph based on the collected data, with nodes representing users and edges representing interactions between them. 3. Community Detection: Use community detection algorithms to identify groups or communities of users within the social network. 4. Influence Analysis: Use influence analysis algorithms to identify influential users within the social network. 5. Model Development: Develop a machine learning model to analyze the network structure and provide insights about the social network. 6. Model Evaluation: Evaluate the performance of the model using metrics such as modularity and centrality. 7. Implementation: Implement the model to analyze the social network structure and provide insights to improve the social media strategy. Code: import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns import re df_i=pd.read_csv('C:/Users/THEASHISHGAVADE/Downloads/social media influencers - insta gram.csv') def convert(x): return re.findall('\d+\.?\d*',x) def change(df,list1): for i in list1: df['new'+i]=df[i].apply(convert) df['new'+i]=df['new'+i].apply(lambda x: "".join(x)) df['new'+i]=pd.to_numeric(df['new'+i]) df['new'+i]=np.where(['M' in j for j in df[i]],df['new'+i]*1000000, np.where(['K' in j1 for j1 in df[i]],df['new'+i]*1000,df ['new'+i])) return df df_i.head(2) df_i.rename({'category_1':'Category','Audience country(mostly)':'Audience Country'},a xis=1,inplace=True) df_i.head(2) df_i.isnull().sum() df_i.drop_duplicates(subset=['Influencer insta name'],inplace=True) df_i.shape df_i.drop(labels=['Influencer insta name','Authentic engagement\r\n'],axis=1,inplace= True) df_i.head(2) df_i.head(2) change(df_i,li) #Engagement rate : the Percentage of Followers who really engages with the content po sted by Influencers df_i['Engagement Rate']=np.round((df_i['newEngagement avg\r\n']/df_i['newFollowers']) *100,3) print(df_i['Followers'].str[-1].unique()) # for convenice df_i['newFollowers']=df_i['newFollowers']/1000000 df_i.drop(labels=['Engagement avg\r\n','newEngagement avg\r\n'],axis=1,inplace=True) df_i.head(5) # TOP 15 most followed celebrity on instagram df_i.sort_values(by='newFollowers',ascending=False,ignore_index=True).iloc[0:15,[0,1, 3,-1]] plt.title('Top 15 most followed celebrity on instagram') plt.xlabel('Followers in Million') sns.barplot(y='instagram name',x='newFollowers',data=df_i.sort_values(by='newFollower s',ascending=False).head(15)) pallete=['red','green','yellow','salmon','cyan','blue','orange'] def plot(df): plt.figure(figsize=(8,6)) plt.xlabel('number of times category occured') plt.ylabel('Category') df['Category'].value_counts().sort_values(ascending=True).plot.barh(color=pallete ) # TOP categories followed on instagram(POPULAR CATEGORIES ON INSTAGRAM) plot(df_i) # Decide That where you want to make ads def plot_c(df): plt.figure(figsize=(10,8)) plt.xlabel('number of times category occured') df['Audience Country'].value_counts().sort_values().plot.barh(color=pallete) # for understanding that where is the demand of product def demand(data,category): return data[data['Category']==category]['Audience Country'].value_counts().sort_v alues(ascending=True).plot.barh(color=pallete) demand(df_i,'Lifestyle') df_i['newFollowers'].describe() df_i['newFollowers'].quantile(0.94) df_i.head(2) def for_mini_followers_instagram(coun,cat): df1=df_i[df_i['Audience Country']==coun] df1_mini=df1[df1['newFollowers']<60] return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego ry').get_group(cat).iloc[:,[0,3,-1]] for_mini_followers_instagram('India','Music') def for_mega_followers_instagram(coun,cat): df1=df_i[df_i['Audience Country']==coun] df1_mini=df1[df1['newFollowers']>60] return df1_mini.sort_values(by='Engagement Rate',ascending=False).groupby('Catego ry').get_group(cat).iloc[:,[0,3,-1]] for_mega_followers_instagram('India','Music') for_mini_followers_instagram('India','Beauty') for_mini_followers_instagram('India','Shows') for_mini_followers_instagram('India','Sports with a ball') for_mega_followers_instagram('India','Sports with a ball') for_mega_followers_instagram('Brazil','Sports with a ball') Output: Influencer insta name instagram name Category category_2 Followers Audience Country Authentic engagement\r\n Engagement avg\r\n dtype: int64 0 21 108 713 0 14 0 0 instagram name Category category_2 Followers Audience Country newFollowers Engagement Rate 0 433 Sports with a ball NaN 48.5M Spain 48.5 1.313 1 TAEYANG Music NaN 12.7M Indonesia 12.7 4.270 2 НАСТЯ ИВЛЕЕВА Shows NaN 18.8M Russia 18.8 2.010 3 Joy Lifestyle NaN 13.5M Indonesia 13.5 10.370 4 Jaehyun NaN NaN 11.1M Indonesia 11.1 27.928 count 997.000000 mean 25.539619 std 40.586338 min 2.600000 25% 9.000000 50% 14.600000 75% 26.500000 max 487.200000 Name: newFollowers, dtype: float64 Explanation: This code is analyzing data related to social media influencers on Instagram. It uses various Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn for data processing, visualization, and analysis. The code loads a CSV file containing data related to social media influencers and performs various data cleaning and manipulation operations such as dropping duplicates, converting data types, and renaming columns. After cleaning the data, the code generates various visualizations such as a bar plot of the top 15 most followed celebrities on Instagram, a bar plot of the most popular categories followed on Instagram, and a bar plot of the countries with the highest demand for different product categories. Lastly, the code defines several functions that can be used to filter the data based on different criteria such as country, category, and number of followers. These functions can be used to analyze the engagement rate of influencers, identify the most popular categories among followers, and find influencers with high engagement rates in specific categories and countries. Conclusion: Structure-based social media analytics is a crucial step in understanding the network of the audience and identifying influential users for businesses. It involves analyzing the social network structure based on user profiles, followers, and interactions between users. Developing a structure-based social media analytics model involves data collection, network construction, community detection, influence analysis, model development, model evaluation, and implementation. By using structure-based social media analytics, businesses can improve their social media strategy, identify influential users, and engage with them to improve their social media presence. EXPERIMENT NO. 7. Aim: Develop a dashboard and reporting tool based on real time social media data. Theory: Developing a dashboard and reporting tool based on real-time social media data is a critical step in monitoring the performance of a business's social media strategy. In this article, we will discuss the theory behind social media dashboards, the execution steps to develop the dashboard and reporting tool, and the benefits it can offer for businesses. A social media dashboard is a tool that provides real-time monitoring of social media activities, including engagement, reach, and impressions. It helps businesses to track the performance of their social media strategy and identify areas for improvement. The dashboard can provide data visualization, including charts and graphs, to enable users to interpret the data easily. Reporting tools are used to create regular reports based on the data gathered by the dashboard. These reports can provide insights into the performance of the social media strategy, identify trends, and suggest areas for improvement. Execution Steps: 1. Identify Key Performance Indicators (KPIs): Identify the KPIs that are important for the business, such as engagement rate, reach, and impressions. 2. Data Collection: Collect data from social media platforms, including user profiles, followers, and interactions between users. 3. Data Processing: Pre-process the data to remove any irrelevant data and prepare it for analysis. 4. Data Visualization: Use data visualization tools such as charts and graphs to represent the data in an easily understandable format. 5. Dashboard Development: Develop a dashboard using a tool such as Tableau, Power BI, or Google Data Studio. 6. Reporting Tool Development: Develop a reporting tool that generates regular reports based on the data collected by the dashboard. 7. Implementation: Implement the dashboard and reporting tool to provide real-time monitoring of social media activities and generate regular reports for the business. Code: import pandas as pd import snscrape.modules.twitter as sntwitter import numpy as np import matplotlib.pyplot as plt import seaborn as sns import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from nltk.stem.porter import PorterStemmer import string import re import textblob from textblob import TextBlob from wordcloud import wordcloud, STOPWORDS from wordcloud import ImageColorGenerator import warnings %matplotlib inline import os #using os library to call CLI commands in python os.system("snscrape --jsonl --max-results 10000 --since 2023-03-13 twittersearch 'CHATGPT4' > text-chatGPT4-tweets.json") #create a pandas dataframe tweets_df_chatGPT4 = pd.read_json('text-chatGPT4-tweets.json',lines=True) #tweets_df_chatGPT4 # 2. Data Loading df_chatGPT4 = tweets_df_chatGPT4[["date", "rawContent","renderedContent","user","repl yCount","retweetCount","likeCount","lang","place","hashtags","viewCount"]] print(df_chatGPT4.shape) # 3. Twitter Data Cleaning, Preprocessing and Exploratory Data Analysis df2 = df_chatGPT4.drop_duplicates('renderedContent') # shape of DataFrame print(df2.shape) df2.head() df2.info() df2.date.value_counts() #Heatmap for missing values plt.figure(figsize=(17,5)) sns.heatmap(df2.isnull(),cbar=True,yticklabels=False) plt.xlabel("Column_Name", size=14,weight="bold") plt.title("Places of missing values is cloumn",fontweight="bold",size=17) plt.show() Output: date rawConten t renderedC ontent user replyC ount retweetC ount likeCo unt la ng pla ce hashtags viewC ount 0 2023-0401 12:46:53+ 00:00 So I keep using up all the ChatGPT4 20 questio... So I keep using up all the ChatGPT4 20 questio... {'_type': 'snscrape.modules.tw itter.User', 'us... 0 0 0 en No ne None NaN 1 2023-0401 12:46:32+ 00:00 @theDontG etRekt @mreflow No in ChatGPT4. You c... @theDontG etRekt @mreflow No in ChatGPT4. You c... {'_type': 'snscrape.modules.tw itter.User', 'us... 0 0 0 en No ne None 1.0 2 2023-0401 12:43:04+ 00:00 Italia Berencana Memblokir ChatGPT, Kenapa Ya?... Italia Berencana Memblokir ChatGPT, Kenapa Ya?... No ne [ChatGPT, chatgpt4, Italia, OpenAICha tGPT, Ope... 6.0 {'_type': 'snscrape.modules.tw itter.User', 'us... 0 0 0 in date rawConten t renderedC ontent user replyC ount retweetC ount likeCo unt la ng pla ce hashtags viewC ount 3 2023-0401 12:40:41+ 00:00 essa parada do chat gpt4 é sinistra, daqui a p... essa parada do chat gpt4 é sinistra, daqui a p... {'_type': 'snscrape.modules.tw itter.User', 'us... 0 0 0 pt No ne None 7.0 4 2023-0401 12:40:29+ 00:00 Basically : \n\nI had expressed my feelings. 🤪... Basically : \n\nI had expressed my feelings. 🤪... {'_type': 'snscrape.modules.tw itter.User', 'us... No ne [ChatGPT4, ChatGPT5, technology, OpenAI, ChatG... 13.0 1 0 0 <class 'pandas.core.frame.DataFrame'> Int64Index: 3335 entries, 0 to 3368 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------------------- ----0 date 3335 non-null datetime64[ns, UTC] 1 rawContent 3335 non-null object 2 renderedContent 3335 non-null object 3 user 3335 non-null object 4 replyCount 3335 non-null int64 5 retweetCount 3335 non-null int64 6 likeCount 3335 non-null int64 7 lang 3335 non-null object 8 place 54 non-null object 9 hashtags 1949 non-null object 10 viewCount 3332 non-null float64 dtypes: datetime64[ns, UTC](1), float64(1), int64(3), object(6) memory usage: 312.7+ KB en Explanation: This code is for collecting and analyzing tweets related to a specific Twitter user or topic. The code uses the Python libraries pandas, snscrape, numpy, matplotlib, seaborn, nltk, re, textblob, and wordcloud to perform data cleaning, preprocessing, and exploratory data analysis on the collected tweets. The snscrape library is used to scrape Twitter data based on a search query, in this case for tweets related to the Twitter user "CHATGPT4". The collected data is stored in a JSON file and then read into a pandas dataframe. The dataframe is then cleaned to remove duplicate tweets and missing values. Exploratory data analysis is performed using visualizations such as a heatmap to show the places of missing values in the dataframe. The code also uses various natural language processing techniques such as tokenization, stopword removal, stemming, and sentiment analysis using the textblob library. Finally, a word cloud is generated to visualize the most frequent words used in the collected tweets. Conclusion: Developing a dashboard and reporting tool based on real-time social media data is a critical step in monitoring the performance of a business's social media strategy. It involves identifying the KPIs, collecting and pre-processing data, visualizing data, and developing the dashboard and reporting tool. By using a social media dashboard and reporting tool, businesses can monitor the performance of their social media strategy in real-time, identify areas for improvement, and generate regular reports to improve their social media presence. EXPERIMENT NO. 8. Aim: Design the creative content for promotion of your business on social media platform. Theory: Designing creative content for promotion of your business on social media platform is a crucial aspect of building a strong online presence. In this article, we will discuss the theory behind designing creative content for social media, the execution steps to create effective content, and the benefits it can offer for businesses. Designing creative content for social media requires a good understanding of your target audience and the social media platform you plan to use. Effective content must capture the audience's attention and communicate the brand's message. Content can take various forms, such as text, images, videos, or infographics. The content must be engaging, informative, and relevant to the target audience. Execution Steps: 1. Define the Target Audience: Identify the target audience, including their interests, preferences, and behaviors, to create content that resonates with them. 2. Choose the Social Media Platform: Choose the social media platform based on the target audience and the business's goals. Different platforms have different formats and audience demographics. 3. Develop a Content Strategy: Develop a content strategy that aligns with the business's goals and the target audience's needs. The strategy should include the type of content, frequency, and tone. 4. Create Content: Create content that aligns with the content strategy, using various formats such as text, images, videos, or infographics. Ensure that the content is relevant, informative, and engaging. 5. Optimize Content: Optimize the content for the chosen social media platform, such as using the right hashtags, keywords, and image sizes. 6. Schedule Content: Schedule the content using social media management tools such as Hootsuite, Buffer, or Sprout Social. 7. Analyze Performance: Analyze the performance of the content using social media analytics tools such as Facebook Insights, Twitter Analytics, or Google Analytics. Use the insights to refine the content strategy. A social media ad campaign to promote a fictional technology business called "TechBoost": Ad #1: Image: A sleek, modern laptop with the TechBoost logo on the screen. "Upgrade your tech game with TechBoost. Our high-performance laptops are designed to keep up with your busy lifestyle, whether you're a student, professional, or gamer. Shop now and experience the power of TechBoost!" Ad #2: Image: A person working on their laptop in a coffee shop, with the TechBoost logo visible on the back of their laptop screen. "Take your work with you wherever you go. With TechBoost, you can work from anywhere with ease. Our laptops are lightweight and easy to carry, so you can stay productive on the go." Ad #3: Image: A person gaming on a TechBoost laptop, with a high-resolution game visible on the screen. "Experience the ultimate gaming performance with TechBoost. Our laptops are equipped with the latest graphics and processors, so you can enjoy your favorite games at the highest level. Get yours now and take your gaming to the next level!" Overall, the campaign showcases TechBoost's high-performance laptops, portability, and versatility. The use of sleek, modern visuals and bold copy is intended to appeal to tech-savvy individuals who are looking for high-quality products. Conclusion: Designing creative content for promotion of your business on social media platform is critical to building a strong online presence. It involves understanding the target audience, choosing the right social media platform, developing a content strategy, creating and optimizing the content, scheduling the content, and analyzing its performance. By creating engaging and relevant content, businesses can attract and retain the target audience, increase brand awareness, and achieve their business goals. EXPERIMENT NO. 9. Aim: Analyze competitor activities using social media data. Theory: Analyzing competitor activities using social media data is an essential part of a business's social media strategy. It helps businesses understand their competitors' strengths and weaknesses, identify new opportunities, and improve their own social media performance. In this article, we will discuss the theory behind analyzing competitor activities using social media data, the execution steps to perform this analysis, and the benefits it can offer for businesses. Analyzing competitor activities using social media data involves monitoring and analyzing their social media activities, such as their content, engagement metrics, audience demographics, and advertising campaigns. It helps businesses gain insights into their competitors' social media strategies, benchmark their own performance, and identify areas for improvement. Effective competitor analysis requires a good understanding of the social media platforms used by competitors and the tools available to monitor their activities. Execution Steps: 1. Identify Competitors: Identify the competitors that the business wants to analyze based on their industry, target audience, and social media presence. 2. Determine Social Media Platforms: Determine the social media platforms used by the competitors and the frequency and types of content they post. 3. Monitor Competitor Activity: Monitor the competitors' social media activities using social media management tools such as Hootsuite, Buffer, or Sprout Social. This helps identify the frequency and type of content posted by the competitors. 4. Analyze Engagement Metrics: Analyze engagement metrics such as likes, comments, shares, and followers, to identify the type of content that resonates with the audience and the level of audience engagement. 5. Evaluate Advertising Campaigns: Evaluate the competitors' advertising campaigns using tools such as Facebook Ads Library or Twitter Ads Transparency Center, to determine the target audience and ad spend. 6. Benchmark Performance: Benchmark the business's social media performance against the competitors using metrics such as audience growth, engagement rate, and advertising spend. 7. Identify Opportunities: Identify new opportunities for the business based on the insights gained from the competitor analysis, such as new content ideas, audience demographics, or advertising strategies. Code: import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt df = pd.read_csv('/kaggle/input/linkedin-influencers-data/ashish_influencers_data.csv') df.head(3) #df.tail(3) df.shape df.columns df.info() df.describe() df['name'].unique() df['name'].unique().shape df.isna().sum() df = df.drop(columns = ['views', 'votes', 'media_type', 'content', 'connections', 'lo cation']) df.isna().sum() df_first = df[(df['name']=='Ashish Gavade')] df_first.head() df_first.info() df_first['followers'] = df_first['followers'].dropna() df_first.isna().sum() df_first['followers'] = df_first['followers'].astype(int) fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_first['time_spent'], df_first['followers'], color='gray') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Number of people', fontsize='11') plt.title('Followers', fontsize='25') plt.grid() plt.show() fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_first['time_spent'], df_first['reactions'], color='forestgreen') ax.bar(df_first['time_spent'], df_first['comments'], color='Blue') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Number of people', fontsize='11') plt.title('Reaction Vs. Comments', fontsize='25') plt.legend(['Reactions', 'Comments']) plt.grid() plt.show() fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_first['time_spent'], df_first['num_hashtags'], color='Purple') ax.bar(df_first['time_spent'], df_first['hashtag_followers'], color='Lightseagreen') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Numbers', fontsize='11') plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25') plt.legend(['Number of Hashtags', 'Hashtag Followers']) plt.grid() plt.show() df_tom = df_first = df[(df['name']=='Tom Goodwin')] # df_tom.head() df_tom['followers'] = df_tom['followers'].dropna() df_tom.isna().sum() df_tom.info() fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_tom['time_spent'], df_tom['followers'], color='gray') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Number of people', fontsize='11') plt.title('Followers', fontsize='25') plt.grid() plt.show() fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_tom['time_spent'], df_tom['reactions'], color='forestgreen') ax.bar(df_tom['time_spent'], df_tom['comments'], color='Blue') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Number of people', fontsize='11') plt.title('Reaction Vs. Comments', fontsize='25') plt.legend(['Reactions', 'Comments']) plt.grid() plt.show() fig, ax = plt.subplots(figsize=(20,8)) ax.bar(df_tom['time_spent'], df_tom['num_hashtags'], color='Purple') ax.bar(df_tom['time_spent'], df_tom['hashtag_followers'], color='Lightseagreen') ax.xaxis.set_major_locator(plt.MaxNLocator(10)) ax.set_xlabel('Time Spent', fontsize='11') ax.set_ylabel('Numbers', fontsize='11') plt.title('Number of Hashtags Vs. Hashtag Followers', fontsize='25') plt.legend(['Number of Hashtags', 'Hashtag Followers']) plt.grid() plt.show() Output: followers num_hashtags hashtag_followers reactions comments views Unnamed: 0 count 34012.000000 3.397000e+04 34012.000000 34012.0 34012.000000 34012.000000 0.0 mean 17005.500000 1.125922e+06 2.099788 0.0 472.956486 26.977273 NaN std 9818.563014 3.057750e+06 3.517457 0.0 4163.929944 216.364372 NaN min 0.000000 1.710000e+02 0.000000 0.0 0.000000 0.000000 NaN 25% 8502.750000 9.914800e+04 0.000000 0.0 7.000000 0.000000 NaN 50% 17005.500000 4.082540e+05 0.000000 0.0 36.000000 2.000000 NaN 75% 25508.250000 7.193340e+05 3.000000 0.0 143.000000 14.000000 NaN max 34011.000000 1.828935e+07 48.000000 0.0 391498.000000 32907.000000 NaN array(['Ashish Gavade', 'Jonathan Wolfer', 'Karen Gross', 'Kaia Niambi Shivers Ph.D.', "Daniel Cohen-I'm Flyering", 'Natalie Riso', 'Dale Corley', 'James Calder', 'Yasi Baiani', 'Julie Kliger', 'Stephanie C. O.', 'Michelle de Haaff', 'Bertalan Meskó, MD, PhD', 'Michelle Chaffee', 'Beth Seidenberg', 'Russell Benaroya', 'Richard Branson', 'Mohamed El-Erian', 'James Altucher', 'Bernard Marr', 'Ian Bremmer', 'Sramana Mitra', 'Lynne Everatt', 'Justin Bariso', 'Carson Tate', 'Gary Frisch', 'James Bareham', 'Tai T.', 'Glenn Leibowitz', 'Marianne Griebler', 'Tom Goodwin', 'Katie Martell', 'Shama Hyder', 'Barry W. Enderwick', 'Steve Blakeman', 'Gillian Zoe Segal', 'Tom Foremski', 'Kiara Imani Williams, Esq.', 'Kellye Whitney', 'Simon Owens', 'Rachel Jones', 'Vikram Mansharamani', '🏉 Pascal Bouvier', 'Geoffrey Garrett', 'Ben Casselman', 'Tamal Bandyopadhyay', 'Karen Webster', 'Jody Padar', 'Hansi Mehrotra', 'Nick Ciubotariu', 'Neil Hughes', 'Nir Eyal', 'Shelly Palmer', 'Lee Naik', 'Danielle Newnham', 'Vani Kola', 'Chris McCann', 'Andrew Yang', 'Lisa Abeyta', 'Juliet de Baubigny', 'Sarah Kauss', 'Pocket Sun', 'Chantel Soumis', 'String Nguyen', 'Quentin Michael Allums', 'AJ Wilcox', "Kevin O'Leary", 'Amy Blaschka', 'Simon Sinek'], dtype=object) Explanation: This code is used for analyzing LinkedIn influencers' data by reading a CSV file containing data about their name, time spent on LinkedIn, number of followers, reactions, comments, hashtags, and hashtag followers. The code drops some columns from the DataFrame, cleans the data, and creates visualizations to compare the data of two influencers named Nicholas Wyman and Tom Goodwin. The visualizations show the number of followers, reactions, comments, number of hashtags, and hashtag followers against the time spent on LinkedIn for each influencer. Conclusion: : Analyzing competitor activities using social media data is crucial for businesses to improve their social media performance, gain new insights, and identify new opportunities. It involves identifying competitors, monitoring their social media activities, analyzing engagement metrics, evaluating advertising campaigns, benchmarking performance, and identifying opportunities. By performing effective competitor analysis, businesses can gain a competitive edge and achieve their social media goals. EXPERIMENT NO. 10. Aim: Develop social media text analytics models for improving existing product/ service by analyzing customer‘s reviews/comments. Theory: Developing social media text analytics models for improving existing products/services by analyzing customer reviews/comments is an essential part of a business's social media strategy. It helps businesses understand customer sentiment and feedback, identify areas for improvement, and take necessary actions to improve their products/services. In this article, we will discuss the theory behind developing social media text analytics models, the execution steps to perform this analysis, and the benefits it can offer for businesses. Developing social media text analytics models involves collecting and analyzing customer reviews/comments from social media platforms such as Facebook, Twitter, and LinkedIn. Text analytics techniques such as sentiment analysis, topic modeling, and opinion mining can be used to gain insights into customer sentiment and feedback, identify areas for improvement, and improve existing products/services. Execution Steps: 1. Collect Customer Reviews/Comments: Collect customer reviews/comments from social media platforms such as Facebook, Twitter, and LinkedIn using social media management tools such as Hootsuite, Buffer, or Sprout Social. 2. Preprocess the Data: Preprocess the data by removing irrelevant information such as URLs, hashtags, and mentions, and perform text normalization techniques such as tokenization, stemming, and lemmatization. 3. Perform Sentiment Analysis: Perform sentiment analysis to identify the polarity of customer reviews/comments, whether positive, negative, or neutral, using techniques such as rule-based, machine learning, or hybrid approaches. 4. Perform Topic Modeling: Perform topic modeling to identify the topics mentioned in customer reviews/comments using techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF). 5. Perform Opinion Mining: Perform opinion mining to identify the opinion holders and their views on specific product/service aspects, such as quality, price, or customer service. 6. Identify Areas for Improvement: Identify the areas for improvement based on the insights gained from the analysis, such as improving product quality, pricing strategy, or customer service. 7. Take Necessary Actions: Take necessary actions to address the identified areas for improvement, such as revising product/service features, pricing, or customer support. Code: import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import warnings warnings.filterwarnings("ignore") from nltk.corpus import stopwords import nltk import re # Input data files are available in the read-only "../input/" directory import os for dirname, _, filenames in os.walk('/input/THEASHISHGAVADE'): for filename in filenames: print(os.path.join(dirname, filename)) import pandas as pd import numpy as np import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore") import seaborn as sns reviews_data = pd.read_csv("/input/THEASHISHGAVADE/kindle-reviews/ashish_kindle_reviews.csv") reviews_data.shape reviews_data.columns reviews_data.head() # Renaming Columns (Names) reviews_data.rename(columns={'asin':'ProductId', 'overall':'Score', 'reviewText':'Tex t', 'reviewerID':'UserId', 'summary':'Summary', 'unixReviewTime':'Time'}, inplace=Tru e) reviews_data.head() # Drop Unwanted Columns reviews_data.drop(['Unnamed: 0','helpful','reviewTime','reviewerName'], axis=1, inpla ce=True) reviews_data.head() # Distribution of Reviews Per Score category_dist = reviews_data['Score'].value_counts() plt.figure(figsize=(10,6)) my_colors = ['g', 'r', 'b', 'm', 'y'] category_dist.plot(kind='bar', color=my_colors) plt.grid() plt.xlabel("Scores") plt.ylabel("Number of Reviews Per Score") plt.title("Distribution of Reviews Per Score") plt.show() reviews_data = reviews_data.loc[reviews_data['Score'] != 3] reviews_data.shape #give reviews with Score > 3 a positive rating and reviews with a score < 3 a negativ e rating def partition(x): if x < 3: return 'Negative' else: return 'Positive' actualScore = reviews_data['Score'] pos_neg = actualScore.map(partition) reviews_data['Score'] = pos_neg reviews_data.head() reviews_data = reviews_data.head(50000) #considering only 50k rows category_dist = reviews_data['Score'].value_counts() plt.figure(figsize=(10,6)) my_colors = ['g', 'r'] category_dist.plot(kind='bar', color=my_colors) plt.grid() plt.xlabel("Scores") plt.ylabel("Number of Reviews") plt.title("Distribution of Reviews") plt.show() reviews_data[reviews_data['UserId']=='A3SPTOKDG7WBLN'] #sorting data according to ProductId in ascending order reviews_data = reviews_data.sort_values('ProductId', axis=0, ascending=True, inplace= False, kind='quicksort', na_position='last') #deduplication of entries reviews_data = reviews_data.drop_duplicates(subset={"ProductId","UserId","Time","Text "}, keep='first', inplace=False) reviews_data.shape # Data Preprocessing #loading_the_stop_words_from_nltk_library_ stop_words = set(stopwords.words('english')) def txt_preprocessing(total_text, index, column, df): if type(total_text) is not int: string = "" #replace_every_special_char_with_space total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text) #replace_multiple_spaces_with_single_space total_text = re.sub('\s+',' ', total_text) #converting_all_the_chars_into_lower_case total_text = total_text.lower() for word in total_text.split(): #if_the_word_is_a_not_a_stop_word_then_retain_that_word_from_the_data if not word in stop_words: string += word + " " df[column][index] = string for index, row in reviews_data.iterrows(): if type(row['Text']) is str: txt_preprocessing(row['Text'], index, 'Text', reviews_data) else: print("THERE IS NO TEXT DESCRIPTION FOR ID :",index) reviews_data.head() #checking null values reviews_data.isna().sum() #removing null values(row) reviews_data.dropna(axis=0, inplace=True) reviews_data.isna().sum() reviews_data.shape reviews_data.tail() reviews_data['Score'].value_counts() from sklearn.utils import resample #create two different dataframe of majority and minority class cls_majority = reviews_data[(reviews_data['Score']=='Positive')] cls_minority = reviews_data[(reviews_data['Score']=='Negative')] # upsample minority class cls_minority_upsampled = resample(cls_minority, replace=True, #sample with replacement n_samples= 44381, #to match majority class random_state=42) #reproducible results # Combine majority class with upsampled minority class upsampled_data = pd.concat([cls_minority_upsampled, cls_majority]) upsampled_data.head() upsampled_data.shape upsampled_data['Score'].value_counts() # Train Test Split from sklearn.model_selection import train_test_split X = upsampled_data['Text'] Y = upsampled_data['Score'] X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, stratify=Y, random_state=0) X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.20, str atify=y_train, random_state=0) print("NUMBER OF DATA POINTS IN TRAIN DATA :", X_train.shape[0]) print("NUMBER OF DATA POINTS IN CROSS VALIDATION DATA :", X_cv.shape[0]) print("NUMBER OF DATA POINTS IN TEST DATA :", X_test.shape[0]) # TF-IDF #perform_tfidf_vectorization_of_text_data from sklearn.feature_extraction.text import TfidfVectorizer text_vec = TfidfVectorizer(min_df=10, max_features=5000) text_vec.fit(X_train.values) train_text = text_vec.transform(X_train.values) test_text = text_vec.transform(X_test.values) cv_text = text_vec.transform(X_cv.values) print("Shape of Matrix - TFIDF") print(train_text.shape) print(test_text.shape) print(cv_text.shape) # Confusion / Precision / Recall Matrix #this_function_plots_the_confusion_matrices_given_y_i_and_y_i_hat_ from sklearn.metrics import confusion_matrix import seaborn as sns def plot_confusion_matrix(test_y, predict_y): C = confusion_matrix(test_y, predict_y) #confusion_mat A =(((C.T)/(C.sum(axis=1))).T) #recall_mat B =(C/C.sum(axis=0)) #precision_mat labels = [0,1] #representing_C_in_heatmap_format print("-"*40, "Confusion Matrix", "-"*40) plt.figure(figsize=(8,5)) sns.heatmap(C, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab els=labels) plt.xlabel('Predicted Class') plt.ylabel('Original Class') plt.show() #representing_B_in_heatmap_format print("-"*40, "Precision Matrix (Columm Sum=1)", "-"*40) plt.figure(figsize=(8,5)) sns.heatmap(B, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab els=labels) plt.xlabel('Predicted Class') plt.ylabel('Original Class') plt.show() #representing_A_in_heatmap_format print("-"*40, "Recall Matrix (Row Sum=1)", "-"*40) plt.figure(figsize=(8,5)) sns.heatmap(A, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklab els=labels) plt.xlabel('Predicted Class') plt.ylabel('Original Class') plt.show() # Logistic Regression Model #train a logistic regression + calibration model using text features which are tfidf encoded from sklearn.linear_model import SGDClassifier from sklearn.calibration import CalibratedClassifierCV from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss alpha = [10 ** x for x in range(-5, 1)] cv_log_error_array=[] for i in alpha: clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42) clf.fit(train_text, y_train) sig_clf = CalibratedClassifierCV(clf, method="sigmoid") sig_clf.fit(train_text, y_train) predict_y = sig_clf.predict_proba(cv_text) cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e15)) print('For Values of Alpha =',i,"The Log Loss is:",log_loss(y_cv, predict_y, labe ls=clf.classes_, eps=1e-15)) fig, ax = plt.subplots() ax.plot(alpha, cv_log_error_array, c='r') for i, txt in enumerate(np.round(cv_log_error_array,3)): ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i])) plt.grid() plt.title("Cross Validation Error for Each Alpha") plt.xlabel("Alpha i's") plt.ylabel("Error Measure") plt.show() best_alpha = np.argmin(cv_log_error_array) clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=4 2) clf.fit(train_text, y_train) lr_sig_clf = CalibratedClassifierCV(clf, method="sigmoid") lr_sig_clf.fit(train_text, y_train) predict_y = lr_sig_clf.predict_proba(train_text) print('For Values of Best Alpha =', alpha[best_alpha],"The Train Log Loss is:",log_lo ss(y_train, predict_y, labels=clf.classes_, eps=1e-15)) predict_y = lr_sig_clf.predict_proba(test_text) print('For Values of Best Alpha =', alpha[best_alpha],"The Test Log Loss is:",log_los s(y_test, predict_y, labels=clf.classes_, eps=1e-15)) predict_y = lr_sig_clf.predict_proba(cv_text) print('For Values of Best Alpha =', alpha[best_alpha],"The Cross Validation Log Loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15)) lr_train_accuracy = (lr_sig_clf.score(train_text, y_train)*100) lr_test_accuracy = (lr_sig_clf.score(test_text, y_test)*100) lr_cv_accuracy = (lr_sig_clf.score(cv_text, y_cv)*100) print("Logistic Regression Train Accuracy -",lr_train_accuracy) print("Logistic Regression Test Accuracy -",lr_test_accuracy) print("Logistic Regression CV Accuracy -",lr_cv_accuracy) plot_confusion_matrix(y_cv, lr_sig_clf.predict(cv_text.toarray())) # Predict - Test Data test_pred = lr_sig_clf.predict(test_text) from sklearn.metrics import classification_report print(classification_report(y_test, test_pred)) test_pred_list = test_pred.tolist() test_pred_list[:5] final_test_df = pd.DataFrame({'Text':X_test, 'Review':test_pred_list}) final_test_df.head(10) final_test_df.values[5] Output: For Values of Best Alpha = 1e-05 The Train Log Loss is: 0.15253560593949955 For Values of Best Alpha = 1e-05 The Test Log Loss is: 0.18438127714047778 For Values of Best Alpha = 1e-05 The Cross Validation Log Loss is: 0.1874222604802398 ----------------------------------- Confusion Matrix ---------------------------------------- ----------------------------- Recall Matrix (Row Sum=1) ------------------------------ Explanation: The code is an implementation of sentiment analysis on Kindle product reviews. The dataset contains information about Kindle products, including the reviews given by customers for each product. The code preprocesses the data by removing unwanted columns, cleaning the text, handling null values, and balancing the dataset. The balanced dataset is then split into training, validation, and testing sets. The sentiment analysis is done using a machine learning model. However, the code provided does not include the model building and training steps. Therefore, it is unclear what model or algorithm was used for the analysis. Nonetheless, the code provides a good foundation for data preprocessing, which is essential for any machine learning project. The code uses the Pandas library for data manipulation and Matplotlib for data visualization. It also uses the NLTK library to remove stop words and perform text cleaning. Finally, it uses Scikit-learn's train_test_split and resample functions for dataset splitting and balancing, respectively. Conclusion: Developing social media text analytics models for improving existing products/services by analyzing customer reviews/comments is crucial for businesses to improve customer satisfaction, gain new insights, and identify areas for improvement. It involves collecting customer reviews/comments, preprocessing the data, performing sentiment analysis, topic modeling, and opinion mining, identifying areas for improvement, and taking necessary actions. By performing effective text analytics, businesses can gain a competitive edge, improve customer loyalty, and achieve their business goals.