Experiment 2
Write the code for this . split dataset as training and
testing in different ratio.
Below is sample Python code for downloading an Iris, COVID-19 symptoms, or SMS Spam classi cation
dataset and preprocessing it, including splitting the data into training and testing sets with customizable ratios.
Choose the relevant section for the dataset assigned.
Option 1: Iris Dataset
# Iris Dataset: Download, preprocess, split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load the dataset
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target
# Custom train-test split ratio
train_ratio = 0.8 # set to e.g., 0.7 for 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_ratio, stratify=y, random_state=42
)
# Preprocessing: scale features
scaler = StandardScaler()
X_train_scaled = scaler. t_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for easier viewing
X_train_df = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_df = pd.DataFrame(X_test_scaled, columns=X.columns)
fi
fi
print("Train shape:", X_train_df.shape, y_train.shape)
print("Test shape:", X_test_df.shape, y_test.shape)
Option 2: COVID-19 Symptoms Classi cation (Tabular Example)
# COVID-19 Symptoms Dataset: Preprocessing and split
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Load your dataset (replace lename!)
df = pd.read_csv('covid_data.csv')
X = df.drop(columns=['target']) # Set your target column
y = df['target']
# Custom split ratio
train_ratio = 0.7 # Change as desired
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=train_ratio, stratify=y, random_state=42
)
# Identify feature types
num_cols = X_train.select_dtypes(include=['number']).columns
cat_cols = X_train.select_dtypes(include=['object', 'category', 'bool']).columns
# Impute and scale numerics; encode categoricals
num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_num = scaler. t_transform(num_imputer. t_transform(X_train[num_cols]))
X_test_num = scaler.transform(num_imputer.transform(X_test[num_cols]))
X_train_cat = ohe. t_transform(cat_imputer. t_transform(X_train[cat_cols]))
X_test_cat = ohe.transform(cat_imputer.transform(X_test[cat_cols]))
import numpy as np
X_train_prep = np.hstack([X_train_num, X_train_cat])
fi
fi
fi
fi
fi
fi
X_test_prep = np.hstack([X_test_num, X_test_cat])
print("Train shape:", X_train_prep.shape, y_train.shape)
print("Test shape:", X_test_prep.shape, y_test.shape)
Option 3: SMS Spam Classi cation
# SMS Spam Dataset: Preprocessing and split
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import T dfVectorizer
# Load the dataset (replace lename!)
df = pd.read_csv('sms_spam.csv') # Columns: 'label', 'message'
df = df[['label', 'message']].dropna().drop_duplicates()
# Custom split ratio
train_ratio = 0.75 # Change as desired
X_train, X_test, y_train, y_test = train_test_split(
df['message'], df['label'], train_size=train_ratio, stratify=df['label'], random_state=42
)
# Preprocessing: Clean and vectorize text
import re
def clean_text(s):
s = s.lower()
s = re.sub(r'http\S+|www\S+', ' URL ', s)
s = re.sub(r'[^a-z0-9\s]', ' ', s)
s = re.sub(r'\s+', ' ', s).strip()
return s
X_train_clean = X_train.apply(clean_text)
X_test_clean = X_test.apply(clean_text)
t df = T dfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95)
X_train_vec = t df. t_transform(X_train_clean)
X_test_vec = t df.transform(X_test_clean)
print("Train shape:", X_train_vec.shape, y_train.shape)
fi
fi
fi
fi
fi
fi
fi
fi
print("Test shape:", X_test_vec.shape, y_test.shape)
Modify train_ratio in any script to get different train–test splits (e.g., 0.7, 0.8, 0.6). Save preprocessed
data to les or proceed to modeling as needed.
1.
https://www.geeksforgeeks.org/machine-learning/python-basics-of-pandas-using-iris-dataset/
2.
https://github.com/shrudex/sms-spam-detection
3.
https://codesignal.com/learn/courses/modeling-the-iris-dataset-with-tensor ow/lessons/preprocessing-the-iris-dataset-fortensor ow
4.
https://www.kaggle.com/code/aclorena/data-pre-processing-iris
5.
https://www.ashokcharan.com/Marketing-Analytics/~sma-analysis-and-visualization-of-the-iris-dataset.php
6.
https://pub.aimind.so/enhancing-data-management-in-python-ef cient-techniques-for-preprocessing-the-iris-datasetwith-85182fdeb790
7.
https://scholarworks.gvsu.edu/cgi/viewcontent.cgi?article=1232&context=gradprojects
8.
https://dev.to/marvelefe/analysing-patterns-in-an-sms-spam-dataset-using-data-mining-techniques-2pk9
9.
https://www.quarkml.com/2022/05/iris-dataset-classi cation-with-python.html
fl
fi
fi
fi
fl
10. https://www.kaggle.com/code/mithilesh16/covid-19-symptom-analysis