Improving the classification of Soil imbalanced data using machine learning algorithms in Some Part of Zanjan provice land

Rahimi Mashkale, Mastaneh; Delavar, Mohammad Amir; Jamshidi, Mohammad; Sharififar, Amin

doi:10.22055/agen.2023.43838.1667

Document Type : Research Paper

Authors

¹ Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran

² Associate Professor, Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran

³ Assistant Professor, Soil and Water Research Institute, Agricultural Research, Education and Extension Organization, Karaj, Iran

⁴ Researcher, Department of Soil Science, Faculty of Agriculture, University of Tehran, Iran

https://doi.org/10.22055/agen.2023.43838.1667

Abstract

Introduction: Despite the great use of digital soil maps, the problems of imbalance in classification disrupt the classification performance of many machine learning algorithms, and for this reason, it has attracted the attention of many researchers. Therefore, the aim of this research is to improve the classification of unbalanced soil data using resampling pretreatment technique in three forecasting models including Random forest (RF), Boosted regression trees (BRT) and Multinomial logistic regression (MNLR) in a part of the lands of Zanjan province in Iran.
Materials and Methods: Sampling was done based on a regular grid pattern with 500 meters intervals, and 148 soil surfaces were randomly studied and classified. The region's soils at the subgroup level were in five classes with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. Environmental covariates included geomorphological and geological maps, digital elevation model (DEM), and remote sensing (RS), selected by principal component analysis (PCA) and expert knowledge methods AND a number of environmental variables including geomorphological map information, Geological information and features extracted from the DEM were selected as the most effective environmental variables for predicting soil classes and as input to the model. Extraction of environmental covariates was done in ENVI and SAGA_GIS software and modeling of soil-landscape relationship was done using the aforementioned algorithms in Rstudio software. The resampling technique was applied to the minority and majority soil classes prior to modeling.
Results and Discussion: The results showed that using original data that have imbalanced classes for mapping resulted in loss of the minority classes and relatively low Kappa agreement values and overall accuracy for RF (ovrall=65%, k=0.32) and BRT models (ovrall=60%, k=0.35). However, after resampling the data, two overall accuracy and Kappa coefficient statistics increased in all models. In addition, the BRT model provided an acceptable estimate by maintaining the minority classes and the Kappa coefficient of 0.64 and the overall accuracy of 75% in the spatial prediction of soil subgroups. The producer accuracy (PA) and user accuracy (UA) results showed that the two classes of Gypsic Haploxerepts and Lithic Xerorthents, which were excluded when training using imbalanced datasets in RF and BRT algorithms, showed significant improvement after balancing the data. Results show that they were well predicted in RF algorithm (UA =100%, 78%) and BRT algorithm (UA= 60% and 70%) using treated data. Also, these minority classes showed Producer accuracy in RF algorithm (PA= 75%, 88%) and BRT algorithm (PA=100%, 78%) in compared to zero accuracy when training using imbalanced data. On the other hand, the validation results of the MNLR algorithm showed that despite maintaining the minority classes after balancing the data, the minority classes were predicted with less accuracy. Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in evaluations. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps. Despite the great use of digital soil maps, the problems of imbalance in classification disrupt the classification performance of many machine learning algorithms, and for this reason, it has attracted the attention of many researchers. Therefore, the aim of this research is to improve the classification of unbalanced soil data using resampling pretreatment technique in three forecasting models including Random forest (RF), Boosted regression trees (BRT) and Multinomial logistic regression (MNLR) in a part of the lands of Zanjan province in Iran.Sampling was done based on a regular grid pattern with 500 meters intervals, and 148 soil surfaces were randomly studied and classified. The region's soils at the subgroup level were in five classes with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. Environmental covariates included geomorphological and geological maps, digital elevation model (DEM), and remote sensing (RS), selected by principal component analysis (PCA) and expert knowledge methods AND a number of environmental variables including geomorphological map information, Geological information and features extracted from the DEM were selected as the most effective environmental variables for predicting soil classes and as input to the model. Extraction of environmental covariates was done in ENVI and SAGA_GIS software and modeling of soil-landscape relationship was done using the aforementioned algorithms in Rstudio software. The resampling technique was applied to the minority and majority soil classes prior to modeling.The results showed that using original data that have imbalanced classes for mapping resulted in loss of the minority classes and relatively low Kappa agreement values and overall accuracy for RF (ovrall=65%, k=0.32) and BRT models (ovrall=60%, k=0.35). However, after resampling the data, two overall accuracy and Kappa coefficient statistics increased in all models. In addition, the BRT model provided an acceptable estimate by maintaining the minority classes and the Kappa coefficient of 0.64 and the overall accuracy of 75% in the spatial prediction of soil subgroups. The producer accuracy (PA) and user accuracy (UA) results showed that the two classes of Gypsic Haploxerepts and Lithic Xerorthents, which were excluded when training using imbalanced datasets in RF and BRT algorithms, showed significant improvement after balancing the data. Results show that they were well predicted in RF algorithm (UA =100%, 78%) and BRT algorithm (UA= 60% and 70%) using treated data. Also, these minority classes showed Producer accuracy in RF algorithm (PA= 75%, 88%) and BRT algorithm (PA=100%, 78%) in compared to zero accuracy when training using imbalanced data. On the other hand, the validation results of the MNLR algorithm showed that despite maintaining the minority classes after balancing the data, the minority classes were predicted with less accuracy.
Conclusion: Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in evaluations. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps.

Keywords

Main Subjects

Soil Genesis and Classification

References

Abbaszadeh Afshar, F., and Ayubi, Sh., and Jafari, A., 2017. Spatial prediction of large soil groups using regression models and decision tree in the southeast region of Iran. Agricultural Engineering (Agricultural Scientific Journal), 41(2): 133-146.
Abdi, L. and Hashemi, S., 2015. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE transactions on Knowledge and Data Engineering, 28(1), pp.238-251.
Abeare, S., 2009. Comparisons of boosted regression tree, GLM and GAM performance in the standardization of yellowfin tuna catch-rate data from the Gulf of Mexico lonline [sic] fishery. Louisiana State University and Agricultural & Mechanical College.
Adhikari, K., Hartemink, A.E., Minasny, B., Bou Kheir, R., Greve, M.B. and Greve, M.H., 2014. Digital mapping of soil organic carbon contents and stocks in Denmark. PloS one, 9(8), p.e105519.
Alibeigi, M., Hashemi, S. and Hamzeh, A., 2012. DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets. Data & Knowledge Engineering, 81, pp.67-103.
Breiman, L., 2001. Random forests. Machine learning, 45, pp.5-32.
Breiman, L. and Cutler, A., 2004. Random Forests. Department of Statistics, University of Berkeley.
Caubet, M., Dobarco, M.R., Arrouays, D., Minasny, B. and Saby, N.P., 2019. Merging country, continental and global predictions of soil texture: Lessons from ensemble modelling in France. Geoderma, 337, pp.99-110.
Congalton, R.G., 1991. A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment, 37(1), pp.35-46.
Dominati, E., Patterson, M. and Mackay, A., 2010. A framework for classifying and quantifying the natural capital and ecosystem services of soils. Ecological economics, 69(9), pp.1858-1868.
Elith, J., Leathwick, J.R. and Hastie, T., 2008. A working guide to boosted regression trees. Journal of animal ecology, 77(4), pp.802-813.
Fernández, A., del Jesus, M.J. and Herrera, F., 2009. On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets. Expert Systems with Applications, 36(6), pp.9805-9812.
Gee, G.W. and Or, D., 2002. 2.4 Particle‐size analysis. Methods of soil analysis: Part 4 physical methods, 5, pp.255-293.
Heung, B., Ho, H.C., Zhang, J., Knudby, A., Bulmer, C.E. and Schmidt, M.G., 2016. An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping. Geoderma, 265, pp.62-77.
Jensen, J.R., 1996. Introductory digital image processing: a remote sensing perspective (No. Ed. 2). Prentice-Hall Inc...
Kempen, B., Brus, D.J., Heuvelink, G.B. and Stoorvogel, J.J., 2009. Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151(3-4), pp.311-326.
Kleinbaum, A.M., 2018. Reorganization and tie decay choices. Management Science, 64(5), pp.2219-2237.
Krawczyk, B., 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), pp.221-232.
Kuhn, M. and Johnson, K., 2013. Applied predictive modeling (Vol. 26, p. 13). New York: Springer.
Loeppert, R.H. and Suarez, D.L., 1996. Carbonate and gypsum. Methods of soil analysis: Part 3 chemical methods, 5, pp.437-474.
Loyola-González, O., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A. and García-Borroto, M., 2016. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing, 175, pp.935-947.
Ma, Y., Minasny, B., Malone, B.P. and Mcbratney, A.B., 2019. Pedology and digital soil mapping (DSM). European Journal of Soil Science, 70(2), pp.216-235.
Mallah, S., Delsouz Khaki, B., Davatgar, N., Scholten, T., Amirian-Chakan, A., Emadi, M., Kerry, R., Mosavi, A.H. and Taghizadeh-Mehrjardi, R., 2022. Predicting Soil Textural Classes Using Random Forest Models: Learning from Imbalanced Dataset. Agronomy, 12(11), p.2613.
Malone, B.P., Minasny, B., McBratney, A.B., Malone, B.P., Minasny, B. and McBratney, A.B., 2017. Digital Soil Assessments. Using R for Digital Soil Mapping, pp.245-260.
McBratney, A.B., Santos, M.M. and Minasny, B., 2003. On digital soil mapping. Geoderma, 117(1-2), pp.3-52.
Meng, X.T., Yan, F.G., Cao, B.X., Jin, M. and Zhang, Y., 2022. Efficient real-valued DOA estimation based on the trigonometry multiple angles transformation in monostatic MIMO radar. Digital Signal Processing, 123, p.103437.
Mousavi, S.R., Sarmadian, F., and Rahmani, A., 2020. Modelling and Prediction of Soil Classes Using Boosting Regression Tree and Random Forests Machine Learning Algorithms in Some Part of Qazvin Plain. Iranian Journal of Soil and Water Research, 50(10), pp.2525-2538.
Neyestani, M., Sarmadian, F., Jafari, A., Keshavarzi, A. and Sharififar, A., 2021. Digital mapping of soil classes using spatial extrapolation with imbalanced data. Geoderma Regional, 26, p.e00422.
Pozzolo, A.D., Caelen, O. and Bontempi, G., 2015. Unbalanced: Racing for unbalanced methods selection. R package version, 2.
Ramentol, E., Vluymans, S., Verbiest, N., Caballero, Y., Bello, R., Cornelis, C. and Herrera, F., 2014. IFROWANN: imbalanced fuzzy-rough ordered weighted average nearest neighbor classification. IEEE Transactions on Fuzzy Systems, 23(5), pp.1622-1637.
Richards, A. L. (ed). 1954. Diagnosis and improvement of saline and alkaline soils. US Salinity Laboratory Staff. USDA. Handbook, No. 60, Washington DC. USA.
Sáez, J.A., Krawczyk, B. and Woźniak, M., 2016. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognition, 57, pp.164-178.
Sasada, T., Liu, Z., Baba, T., Hatano, K. and Kimura, Y., 2020. A resampling method for imbalanced datasets considering noise and overlap. Procedia Computer Science, 176, pp.420-429.
Schoeneberger, P. J., Wysocki, D. A., and Benham, E. C. (Eds.). 2012. Field book for describing and sampling soils. Government Printing Office.
Sharififar, A., Sarmadian, F., Malone, B.P. and Minasny, B., 2019. Addressing the issue of digital mapping of soil classes with imbalanced class observations. Geoderma, 350, pp.84-92.
Sharififar, A., Sarmadian, F. and Minasny, B., 2019. Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Computers and Electronics in Agriculture, 159, pp.110-118.
Soil and Water Research Institute. 2010. Site Selection, Soil Survey and Land Evaluation for Development of Orchards in Zanjan Province, Iran.
Soil Survey Staff. 2014. Keys to soil taxonomy, 12th edition. USDA Natural Resources Conservation Service.
Statistical Yearbook of Zanjan Province. 2019. Land and Climate, National Statistics Organization.
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B. and Zhou, Y., 2015. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), pp.1623-1637.
Taghizadeh‐Mehrjardi, R., Schmidt, K., Eftekhari, K., Behrens, T., Jamshidi, M., Davatgar, N., Toomanian, N. and Scholten, T., 2020. Synthetic resampling strategies and machine learning for digital soil mapping in Iran. European Journal of Soil Science, 71(3), pp.352-368.
Taghizadeh-Mehrjardi, R., Mahdianpari, M., Mohammadimanesh, F., Behrens, T., Toomanian, N., Scholten, T. and Schmidt, K., 2020. Multi-task convolutional neural networks outperformed random forest for mapping soil particle size fractions in central Iran. Geoderma, 376, p.114552.
Walkley, A. and Black, I.A., 1934. An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil science, 37(1), pp.29-38.
Zinck, J.A., Metternicht, G., Bocco, G. and Del Valle, H., 2016. Geopedology. An integration of geomorphology and pedology for soils and landscape studies: Springer International Publishing Switzerland, 556p.

Improving the classification of Soil imbalanced data using machine learning algorithms in Some Part of Zanjan provice land

References

References

Volume 46, Issue 1June 2023Pages 61-82

Volume 46, Issue 1
June 2023
Pages 61-82