JSPM’S Jayawantrao Sawant College Of Engineering, Hadapsar, Pune-28 Department of Computer Engineering Third Year (A.Y. 2021 – 2022) 310255 : Internship - Joel Silas - Roll number : 3259 T.E. B Computer Dept. JSPM’s JSCOE We live in a world where we collect huge amounts of data. Traditional methods and techniques are no longer sufficient to process them. In addition to the sophisticated development of computers, new ways of processing data are evolving. Data Science is a new emerging multidisciplinary field that combines classical disciplines like statistics and mathematics with computer science. The main goal of Data Science is to turn large sets of both unstructured and structured data into useful information that can help organizations to make powerful data-driven decisions. At a high level, data science can be described as a set of fundamental principles necessary for successful extraction of information from data. There are many powerful tools for data scientists that can help them in this process, but in order to use them wisely, data scientists must have much pre-knowledge from statistics, math and computer sciences, and they also need to be able to see business problems from a data perspective. Data science uses scientific methods, processes, algorithms statistics, data mining, databases, and distributed systems to extract knowledge and insights from data. The main goals are : (i) To present a short summary of the history and definition of data science. (ii) To elaborate similarities and differences between Business Intelligence and Data Science. (iii) To overview the life cycle of data science. (iv) To outline the benefits and various applications of data. “Machine Learning is a field of study that gives computers the ability to learn and determine the output without being explicitly programmed.” -Arthur Samuel (1959) Machine learning (ML) is a sub-domain of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values. The python programming language provides a rich set of libraries to execute the machine learning algorithms. ETG is a company that organizes training programs and internships in an interactive mode where various individuals and team based exercises are conducted in order to bridge the gap between academics and industry practices. Vision : ETG aims to revolutionize the education scenario across the world by inculcating a pragmatic approach among students towards the academic knowledge imparted by institutions Mission : Elite Techno Groups emphasizes on the intellectual development of students by providing them with a practical mode of learning and thereby channeling their technical knowledge towards innovative real world application. The primary purpose of machine learning is to discover patterns in the user data and then make predictions based on these and intricate patterns for answering business questions and solving business problems. Machine learning helps in analyzing the data as well as identifying trends. It gives enterprises a view of trends in customer behavior and operational business patterns, as well as supports the development of new products. Ultimately it will help to explore, sort and analyze data from various sources and reach conclusions to optimize decisions and business processes. To explore new learning methods and develop general learning algorithms independent of applications. Objectives of the projects completed during this internship : 1. Regression : to predict the value of a dependent variable based on independent variables 2. Classification : to condense mass data by classification based on similarities 3. Reinforcement Learning : learning agent should be able to perceive and interpret its environment, take actions and learn through trial and error. 4. Natural Language Processing (NLP) : to read, understand, and decode human words in a valuable manner to achieve a normal communication between humans and computers Computer Model Standard x86 (32-bit) or x86 (64-bit) compatible desktop or laptop computer Memory At least 1GB of RAM Operating System Requirement (Any one) •Windows 10, 32- or 64-bit versions •Windows 8 or 8.1, 32- or 64-bit versions •Windows 7, 32- or 64-bit versions Software Requirement (Any one) • Anaconda Navigator • Google Colab • Jupyter Notebook • PyCharm IDE • Visual Studio Code 1. Machine learning is the scientific process of training systems to act upon data without requiring explicit, programmed instructions. 2. A subtype of Artificial Intelligence (AI) called as machine learning leverages algorithms and statistical models to identify patterns and predict future outcomes. 3. Most machine learning initiatives fit within the models: supervised, unsupervised and reinforcement learning. Supervised machine learning : It begins with a known, labeled dataset — often called “training data” — and uses that data to make predictions, which are compared against actual outcomes in order to further refine the algorithm. Unsupervised algorithms : These leverage unlabeled data in order to provide a deeper understanding of how computers identify patterns. Reinforcement learning : Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. 1. Inventory Management System (Python programming) – An inventory is a store-house for a shop where it helps understand the stock that is available and which is required based on the market demand. This when implemented in python helps to manage the stock easily. It also provides information such as expiry date so that the products can be distributed carefully. Project Link : https://github.com/joelsilas1816/Inventory-Management-System-for-Skill-India- AI-ML-internship/blob/main/IMS%20add%20products.ipynb 2. Data Analysis and Visualization – It involves steps such as data wrangling to clean the data in which missing values and outliers are removed, required features are extracted to make it ready for analysis. Also graphical / pictorial representation of data helps identify meaningful insights in case of any Data Science Project. Project Link : https://github.com/joelsilas1816/Olympics-Analysis-Assignment-for-Skill-India-AI- ML-internship/blob/main/Summer.ipynb 3. Student Score Prediction based on number of hours of study (Linear Regression) – Prediction of the marks of the students by determining the relation between his number of hours of study and the marks obtained as per the analysis of students performance in previous exams Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Prediction%20of%20students%20score/Project_1_Prediction_of_students_scor e_based_on_number_of_hours_of_study.ipynb 4. Prediction of Parkinson's Disease (XGBoost Classification) – Based on the characteristics and symptoms present in the medical record, it is predicted that whether for the given characteristics the disease exists in the person or not Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Detection%20of%20Parkinsons%20disease/Project_2_Detection_of_Parkinsons _Disease.ipynb 5. Fake News Detection (PassiveAgressive Classifier) – Classification of the news into REAL and FAKE categories will be done based on the type of words used in the news Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Fake%20News%20Detection/Project_3_Fake_News_Detection.ipynb 6. Best Ad Prediction (Reinforcement Learning) – The algorithm will help to determine the best ad out of many which attracts the customers and proves to be beneficial for the organization. Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Best%20ad%20prediction/Project_1_Best_ad_prediction_Joel_Christopher_Sil as%20(1).ipynb 7. Chatbot (NLP)– The model will be trained based on certain stories, incidences, facts and when questions pertaining it will be asked, the Chabot will be able to respond. Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Chatbot/Project_2_Chatbot_Joel_Christopher_Silas%20(1).ipynb 8. Bank Customer Churn Factor Prediction – Various factors cause the customers to discontinue the service they procure from particular agencies or companies. These factors may be their income, age, banking services like credit card facility which will be evaluated, to identify the cause of bank customer churn. Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Bank%20Customer%20Churn%20Prediction/Joel_Silas_JSCOE__Bank_Custom er_Churn_Problem%20(1).ipynb 9. Sentiment or Feedback Analysis (NLP) – Twitter, the social media platform provides a facility to express views in the form of tweets. These tweets will be analyzed to understand whether the person wants to give a positive or negative remark which will help the company to work on customer expectations and feedback Project Link : https://github.com/joelsilas1816/Data-Science- Projects/blob/main/Sentiment%20Analysis/Sentiment_Analysis.ipynb Machine learning systems are designed to generate maximum business value from ML models used in services and products. If you believe the media hype around AI, you could think that data scientists only focus on achieving state-of-the-art (SOTA) performance and designing ingenious model architectures. The reality is a bit different, and data scientists have many more objectives to accomplish. When building AI systems, it’s always good to take a divide-and-conquer approach with your goal. This means breaking down the problem statement into solvable components, and studying how machine learning could help alleviate certain problems. A good understanding of limitations can help you build better products. The workflow involves understanding the data and performing various analysis to clean data along with visualization to identify inherent patterns. Further depending upon the prediction required and class of data the suitable models and algorithms can be applied to train and test data to generate meaningful results. The process repeats iteratively upto an extent to enhance accuracy. Various algorithms and data structures have already been implemented which simply need to be applied on various types of data for problem-solving. 1. Easily identifies trends and patterns 2. No human intervention needed (automation) 3. Continuous Improvement 4. Handling multi-dimensional and multi-variate data 5. Wide Applications in domains like Education, Healthcare, Armed Forces, Finance, etc. 6. Predictive analysis is possible to make arrangements for handling future consequences. 1. Data Acquisition : Machine Learning requires massive data sets to train on, and these should be inclusive/unbiased, and of good quality. There can also be times where one must wait for new data to be generated. 2. Time and Resources : ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose with a considerable amount of accuracy and relevancy. It also needs massive resources to function. This can mean additional requirements of computer power for one. 3. Certain analysis is short of human understanding : This includes sarcasm in twitter tweets, interpretation of the exclamation mark to express anger, excitement or praise. On execution of these projects deep understanding of Analytics and its mechanism became possible. Importance of Analytics and its implementation in the python programming language was comprehended. This was an extremely great learning experience for me where I could work on many new concepts with hands-on practical experience and insightful knowledge. The faculty, mentors, supervisors and support staff were very cooperative and knowledgeable to address each and every concern that students can have. The sessions were very interactive though it was conducted on virtually. I am glad to exercise my skills on this domain.