Group member1: Tony Braxton Tchio Ngoumeza
Group member2: Kalpan Patel
Data Minning Report
Table of Contents
1. Introduction ................................................... 2
2.Classification Approach ................................. 3
2.1Classification process with decision Tree .... 4
3. Regression Approach ................................... 5
3.1 Key Environmental Metrics ......................... 5
3.2 Regression Process using Decision Tree
Regressor ......................................................... 6
4. Conclusion .................................................... 7
1. Introduction
Data mining plays a crucial role in the analysis of smart farming datasets by providing
valuable insights into crop health, soil quality, and environmental conditions. By applying
data mining techniques, we aim to optimize farming practices and help farmers make
informed decisions. Through the use of various environmental and soil-based indices,
farmers can better understand crop performance, environmental stress, and resource
management. These indices are critical for crop prediction, environmental stress
analysis, and precision agriculture. In this project, we employed both classification and
regression approaches to extract relevant data from our dataset and develop predictive
models that assist in optimizing agricultural practices.
2.Classification Approach
In the classification analysis, we were able to train the model to perform the crop
recommendation to farmers. The task involves predicting the crop type based on
various environmental and soil attributes. The target variable, label, represents 22
different crop options, with each crop corresponding to a specific set of growing
conditions. The features used for prediction include the nutrients in the soil, such as
nitrogen (N), phosphorus (P), and potassium (K), along with environmental factors like
humidity, temperature, and CO2 concentration.
Fig1: Heat Map obtained during initial Exploratory Data Analysis
A heatmap of the feature correlations revealed that the highest correlation with the
target label was observed with P (phosphorus) and K (potassium), indicating that these
two nutrients play a crucial role in determining which crop is most suitable for a given
set of conditions. This insight highlights the importance of phosphorus and potassium in
crop growth. The classification also included Nitrogen, humidity, temperature and
co2_concentration which show a lower correlation with crop label but however play a
significant role in the accuracy of the model’s predictions giving us a 94% accurate
classification for crop recommendation.
2.1Classification process with decision Tree
In the classification part of the crop recommendation system, the decision tree partitions
the dataset into subsets based on feature values, with the goal of minimizing impurity
(entropy) within each subset. For instance, the tree begins by splitting the data using
features like phosphorus (P) or potassium (K), as these nutrients show a strong
correlation with the crop types. The decision tree algorithm evaluates different
thresholds for each feature and selects the one that best separates the classes which in
this case, are represented by the different crop types. This process of splitting the data
continues recursively, with the tree forming branches at each node based on the feature
that provides the best split. The recursive partitioning continues until a stopping
condition is met, either a maximum tree depth is reached, or no further meaningful splits
can be made. Once the tree has been constructed, it makes predictions by following the
splits from the root node down to the leaf nodes. Each leaf node represents a class, and
the class with the majority of samples in that leaf is assigned as the predicted crop type
for new data. In this case, with multiclass classification the leaf node contains the most
frequent class label, which is then used as the prediction for the new instance.
3. Regression Approach
In the regression analysis, we focused on evaluating the impact of various
environmental conditions on crop performance. Several key metrics were used to model
these relationships, using regression techniques to predict or assess factors that
influence crop growth, soil health, and overall productivity.
3.1 Key Environmental Metrics
The following metrics were used to develop the regression models:
Metric
Temperature-Humidity Index
(THI)
Nutrient Balance Ratio (NBR)
Water Availability Index
Photosynthesis Potential
(PP)
Soil Fertility Index
Category
Predicting crop stress
Data Mining Approach
Regression
Compute soils nutrient
balanced ratio based on
Nitrogen, Phosphorus and
potassium levels
Forecasting water availability
Predicting photosynthetic
efficiency
Assesses soil fertility based
on organic matter content
and Nitrogen Phosphorus
and potassium levels
Regression
Regression
Regression
Regression
Table 2: Environmental Conditions – Crop Performance Evaluation.
Temperature-Humidity Index (THI) helps predict crop stress due to heat and moisture
conditions. It is essential for managing temperature-related stress using techniques like
irrigation and shading.
Nutrient Balance Ratio (NBR) assesses the balance of nitrogen, phosphorus, and
potassium in the soil. This index helps identify nutrient deficiencies or excesses,
allowing for better fertilizer management.
Water Availability Index (WAI) is used to forecast water resources based on soil
moisture and rainfall data. It guides decisions regarding irrigation and water
conservation.
Photosynthesis Potential (PP) estimates the efficiency of photosynthesis based on light
exposure, CO2 concentration, and temperature, helping optimize crop growth
conditions.
Soil Fertility Index (SFI) assesses soil quality based on organic matter content and
essential nutrients, providing critical insights for soil management and crop yield
optimization.
3.2 Regression Process using Decision Tree
Regressor
The tree partitions the dataset into subsets based on feature values, aiming to minimize
variance within each subset. Here's how the regression process works:
Data Splitting: The decision tree splits the data into subsets based on feature values.
For example, in the case of THI, it divides the data using temperature and humidity
values. At each node, the tree selects the feature and threshold value that minimizes
variance (for regression tasks).
Recursive Partitioning: The tree recursively divides the data into smaller groups, forming
branches based on the best feature splits. This process continues until a stopping
condition is met, such as a maximum tree depth or a minimum number of samples per
leaf node.
Prediction: After the tree is built, it predicts new data by following the splits down the
tree, eventually landing in a leaf node. The prediction corresponds to the average value
of the data points in that leaf node.
Fig2: Derived attributes correlation heat map
4. Conclusion
The goal of this project is to empower farmers with valuable information derived from
environmental, agricultural, and soil metrics, enabling them to adopt optimized
agricultural practices for sustainable crop growth. By leveraging data mining techniques
such as classification and regression, we can provide predictive insights into crop
recommendation, crop performance, environmental stress, and resource management.
The use of indices like the Temperature-Humidity Index (THI), Nutrient Balance Ratio
(NBR), and Soil Fertility Index (SFI) equips farmers with the tools to make data-driven
decisions, improving both crop yield and farm sustainability. While the crop
recommendations system derived will help farmers select the most suitable crops based
on soil and environmental conditions, thereby optimizing crop selection for their specific
region. This approach supports precision agriculture, offering higher yields, more
efficient use of resources, and sustainable farming practices that benefit both the
environment and farming communities.