Machine Learning Fundamentals

Machine Learning - is a subset of Artificial Intelligence that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed. In traditional programming, rules are hardcoded by humans. Formula: Data + Algorithm → Model → Predictions A Brief History of Machine Learning • 1950s: Alan Turing proposes the Turing Test to measure machine intelligence. • 1957: Frank Rosenblatt develops the Perceptron, the first neural network model. • 1980s–1990s: Neural networks and Support Vector Machines gain popularity. • 2000s–Present: Advances in data storage, cloud computing, and GPUs lead to the deep learning revolution. The Machine Learning Cycle ML development is a continuous process with the following stages: 1. Define the problem clearly. 2. Collect relevant data. 3. Preprocess and clean the data. 4. Select the right model. 5. Train the model with data. 6. Evaluate performance using metrics. 7. Deploy the model into production. 8. Monitor and maintain the model to ensure accuracy over time. Step 1: Problem Definition Clearly identify what you want to solve. The problem definition should specify: • The objective (e.g., predict house prices) • The input features (e.g., location, size, number of rooms) • The expected output (e.g., price) A well-defined problem prevents wasted effort and improves model accuracy. Step 2: Data Collection Data is the foundation of any ML model.Sources of Data: • Public datasets (Kaggle, UCI ML Repository) • APIs (Twitter API, Weather API) • Web scraping • IoT sensors and devices Data Types: Structured (tables), unstructured (images, text, audio) Step 3: Data Preprocessing Raw data often contains missing values, duplicates, or irrelevant features. Preprocessing improves data quality.Common Steps: • Remove or fill missing values. • Normalize numerical values. • Encode categorical data. • Split into training, validation, and test sets. Good preprocessing ensures the model learns accurate patterns. Step 4: Model Selection Choosing the right algorithm depends on the type of problem: • Supervised Learning: Predict outcomes using labeled data (e.g., Linear Regression, Decision Trees, k-NN). • Unsupervised Learning: Find patterns in unlabeled data (e.g., k-Means, PCA). Weka, Google Colab, and even Excel can be used to test different models. Step 5: Model Training Training is the process where the algorithm learns from the training data. • The model adjusts its parameters to minimize errors. • Hyperparameters (e.g., learning rate, number of neighbors in k-NN) are tuned to improve performance.A good training process avoids both underfitting and overfitting. Step 6: Model Evaluation We test the model using unseen data to check its performance.Common Metrics: • Accuracy: % of correct predictions. • Precision & Recall: For imbalanced datasets. • RMSE (Root Mean Squared Error): For regression problems. Visualization tools like a confusion matrix help interpret results. Step 7: Deployment Once the model performs well, it is deployed so it can make predictions in real time. Deployment Platforms: • Web apps (Flask, FastAPI) • Mobile apps • Cloud services (AWS, Azure, GCP) Step 8: Monitoring & Maintenance Models degrade over time as data patterns change (model drift).Best Practices: • Monitor predictions for accuracy. • Retrain with new data. • Keep improving features and algorithms. Understanding Data in ML Types of data used in ML: • Numerical: Age, temperature, price • Categorical: Gender, city, brand • Text: Reviews, articles • Image/Video: Photos, CCTV footage • Audio: Voice commands, music The data type determines preprocessing steps and model choice. Python Libraries for Machine Learning • Data Handling: pandas, numpy – Load and manipulate datasets • Visualization: matplotlib, seaborn, plotly – Create charts and graphs • ML Algorithms: scikit-learn – Train and evaluate models • Deep Learning: tensorflow, keras, pytorch – Build neural networks • Data Collection: requests, BeautifulSoup – Get data from the web Summary • Machine Learning allows computers to learn from data without explicit programming. • The process includes defining the problem, collecting and preparing data, choosing and training a model, evaluating performance, deploying, and maintaining it. • Good quality data and the right tools are key to success. k-Nearest Neighbors (kNN) - algorithm is one of the simplest supervised machine learning methods. ● If most of its nearest neighbors belong to a certain class, then the new object is assigned to that class. ● It is used for both classification and regression tasks. Common distance metrics: • Euclidean distance • Manhattan distance • Minkowsky • Hamming distance Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Rule of thumb: K = sqrt(N) N: number of training points EUCLIDEAN DISTANCE Use when: ● Data is continuous (real numbers). ● Features are on the same scale (e.g., all measured in cm, seconds, etc.). ● You expect data to follow a circular or spherical geometry (points close in a straight-line sense are more similar). Examples: ● Image recognition (pixel intensity values). ● Sensor readings (temperature, humidity, etc.). ● Geometric problems where “straight-line distance” makes sense. MANHATTAN DISTANCE Use when: ● Data is continuous, but you expect differences along individual dimensions to matter more than the overall straight-line distance. ● Geometry is more like a grid (like moving in a city with streets). ● Dataset has high-dimensional data Examples: ● Predicting taxi fares in a city grid (distance = blocks traveled). ● Text mining with word counts (Bag-of-Words models). ● High-dimensional features where sparse differences matter. HAMMING DISTANCE Use when: ● Data is categorical (not numerical). ● Features are strings, binary vectors, or discrete symbols. ● You only care about exact matches vs mismatches, not magnitude of difference. Examples: ● Comparing DNA sequences (A, T, C, G). ● Error detection in communication (bit strings). ● Categorical survey data (Yes/No, Male/Female). WHEN TO USE KNN: ● Datasets are relatively small to medium in size ● Decision boundaries are irregular and Interpretability is important ● No strong assumptions about data ● Applications where similarity matters ○ Medical diagnosis: A new patient is compared with past patients. ○ Recommendation: Suggest items similar to a user’s past choices. ○ Anomaly detection: Flag unusual cases by checking if they’re far from normal Clusters. WHEN NOT TO USE KNN ● With very large datasets ● With high-dimensional data ● Without feature scaling WHAT HAPPENS WHEN THERE’S NO MAJORITY? ● Common Tie-Breaking Approaches in kNN ● Choose the class of the closest neighbor ● Distance-weighted voting ● Reduce k (prefer odd numbers) ● Random selection among tied classes ● Domain-specific rules IMPROVING KNN ● Experiment with different values of k. ● Use weighted kNN (closer neighbors have more influence). ● Apply feature engineering (categorizing BMI, grouping ages). ● Reduce dimensionality (PCA) for large feature spaces. APPLICATIONS ● Healthcare: Predicting diseases (diabetes, heart disease) ● Finance: Detecting fraudulent transactions ● E-commerce: Recommending similar products ● Security: Detecting abnormal network behavior ● Image Recognition: Identifying handwritten digits. FORMULAS SA F1-Score: 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐴𝑐𝑐 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 * 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 𝑇𝑁|𝐹𝑃 𝐹𝑁|𝑇𝑃 APRIORI ALGORITHM ASSOCIATION RULE MINING ● Association rule mining is a technique used to identify patterns in large data sets. ● It involves finding relationships between variables in the data and using those relationships to make predictions or decisions. ● The goal of association rule mining is to uncover rules that describe the relationships between different items in the data set. APRIORI ALGORITHM ● Apriori is an algorithm designed to extract frequent itemsets from transactional databases and generate association rules. ● It is based on the principle that if an itemset is frequent, all its subsets must also be frequent. SUPPORT: The frequency with which an item appears in the dataset. CONFIDENCE: The likelihood that item B is purchased when item A is purchased LIFT: The strength of a rule, measuring how much more likely item B is bought when item A is bought compared to when bought independently Lift > 1 → Positive association Lift = 1 → No association Lift < 1 → Negative association MINIMUM SUPPORT: minimum of proportion of transactions in which an itemset appears. Purpose: to eliminate infrequent itemsets early in the Process MINIMUM CONFIDENCE: the proportion of transactions containing itemset A that also contain itemset B Purpose: to filter association rules that are not strong or reliable APPLICATIONS • Market basket analysis: Retailers use Apriori to analyze purchase patterns, helping them arrange products to encourage combined purchases. • Recommendation systems: Online platforms use Apriori to suggest products based on previous purchases. • Anomaly detection: Apriori identifies unusual transactions by comparing them to expected patterns.

Machine Learning Fundamentals

Related documents

Products

Support

Machine Learning Fundamentals

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib