نوع مقاله : مقاله پژوهشی
نویسندگان
1 دانشجوی دکتری گروه علوم خاک دانشکده کشاورزی، دانشگاه زنجان، ایران
2 دانشیار گروه علوم خاک دانشکده کشاورزی، دانشگاه زنجان، ایران
3 استادیار موسسه تحقیقات خاک و آب، سازمان تحقیقات، آموزش و ترویج کشاورزی، کرج، ایران
چکیده
طبقه بندی داده های نامتعادل به یک موضوع تحقیقاتی مهم در زمینه داده کاوی تبدیل شده است. هدف از انجام این پژوهش شناسایی صحیح نمونه های کلاس اقلیت و افزایش دقت طبقه بندی کلاس های خاک با استفاده از رویکرد مدل تجمعی در بخشی از اراضی جنوب غربی استان زنجان است. تعداد 148 خاکرخ با روش الگوی شبکهبندی منظم و میانگین فاصله 500 متر حفر، تشریح و با تجزیه و تحلیل آزمایشگاهی تا سطح فامیل رده بندی گردید. مناسب ترین متغیرهای محیطی بر اساس نظر کارشناسی و رویکرد تحلیل مؤلفه اصلی از میان 57 متغیر شامل اطلاعات نقشه های ژئومورفولوژی و زمین شناسی، مدل رقومی ارتفاع و داده های حاصل از تصاویر ماهوارهای لندست 8 برای پیش بینی کلاس های خاک انتخاب شد. مدلسازی رابطه خاک - زمین نما با استفاده از الگوریتم های یادگیرنده جنگل تصادفی، درخت تصمیم توسعهیافته و رگرسیون لجستیک چندجمله ای و مدل تجمعی (بعد از متعادل سازی داده ها) در محیط نرمافزار "Rstudio" انجام شد. صحت کلی و ضریب کاپا برای ارزیابی کلاس های خاک در سطح زیرگروه به ترتیب در مدل های فردی رگرسیون لجستیک چندجمله ای 65 درصد و 0/41، جنگل تصادفی 65 درصد و 0/32، درخت تصمیم توسعهیافته 60 درصد و 0/35 و در مدل تجمعی 70 درصد و 0/62 به دست آمد. نتایج صحت کاربر و صحت تولیدکننده نشان داد در میان مدل های فردی، مدل رگرسیون لجستیک چندجمله ای دقت بالاتری در پیش بینی کلاس های خاک دارد.
کلیدواژهها
موضوعات
عنوان مقاله [English]
Using Ensemble Model Approach for Spatial Modeling of Soil Imbalanced Classes
نویسندگان [English]
- Mastaneh Rahimi Mashkaleh 1
- Mohammad Amir Delavar 2
- Mohammad Jamshidi 3
1 Ph.D. Student of Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan
2 Associate Professor, Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran
3 Assistant Professor, Soil and Water Research Institute, Agricultural Research, Education and Extension Organization, Karaj, Iran
چکیده [English]
Introduction: Imbalanced data remains a widespread and significant challenge, particularly impacting machine learning algorithms. Therefore, addressing imbalanced data classification has emerged as a crucial research area within the field of data mining. This issue, often characterized by a limited number of instances in one class and a substantial number in other classes, poses substantial hurdles for machine learning algorithms. Consequently, data mining experts and machine learning professionals are actively working on refining methods and models for classifying imbalanced data with the aim of improving the accuracy of such classifications. The principal objective of this study is to precisely detect and categorize samples from the minority class, ultimately enhancing the precision of soil class classification. This research is conducted in a specific region, encompassing the southwestern territories of Zanjan province.
Materials and Methods: To achieve this objective, a total of 148 soil profiles were excavated using a regular grid pattern with an average spacing of 500 meters (and in some locations, up to 700 meters based on expert recommendations). After the samples were air-dried, they were transported to the laboratory. Physical and chemical analyses were conducted on all collected samples, including assessments of soil texture, soil pH, calcium carbonate equivalent, cation exchange capacity, electrical conductivity, organic carbon content, and gypsum content. Subsequently, the soil samples were meticulously classified and described up to the family level, following the comprehensive standards of the soil classification system. The most appropriate covariates were selected among 57 covariates including geomorphological and geological maps, digital elevation model (DEM), and data from Landsat 8 satellite images, using principal component analysis (PCA) and expert knowledge approaches for predicting soil classes selected. Saga-GIS and ENVI software were used to extract environmental covariates. Modeling of the soil-landscape relationship was performed using three algorithms, namely multinomial logistic regression (MNLR), random forest (RF), boosted regression tree (BRT) and ensemble model (after data balancing) in “R studio” software. To check the accuracy of the used model, the data was randomly divided into training and validation data. 80% of the data (118 profiles) were used for model training and 20% (30 profiles) were used as validation data for evaluation.
Results and Discussion: The results of the selection of covariates showed that 10 information covariates of geomorphological maps, geological information and features extracted from the digital elevation model (DEM), including Analytical hill shading (AHS), sunrise, valley depth (VD), LS Factor, Channel network distance (CND), Topographic wetness index (TWI) and Multi-resolution ridge top flatness (MRRTF) were selected as input variables. Based on the results of profile analysis, the soils of the region at the subgroup level were categorized into five classes, with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. The results of evaluation metrics such as overall accuracy and Kappa index were 65% and 0.32 for the RF algorithm, %60 and 0.35 for the boosted regression tree algorithm, 65% and 0.41 for the MNLR algorithm and after balancing the data with the ensemble model approach, it was 70% and 0.62 respectively. The results of two statistics of user’s accuracy and producer’s accuracy showed that among individual models, the multinomial logistic regression model has higher accuracy in predicting soil classes. Although the ensemble model has succeeded in predicting the soil minority classes well, due to the fact that the two weaker models of the RF and BRT are involved in the modeling, It showed lower values compared to the individual multinomial logistic regression model, in predicting some classes of the majority of soil, especially the two classes of Typic Haploxerepts and Typic Xerorthents.
Conclusions: Conclusions: In summary, the results have demonstrated that when learning algorithms are individually applied, they do not exhibit high accuracy in spatially predicting soil classes. However, when these algorithms are amalgamated into an ensemble model, they exhibit remarkable accuracy in spatial soil class prediction, outperforming individual models in terms of performance and accuracy. Moreover, the ensemble model substantially enhances prediction accuracy and reduces the occurrence of misclassifications, especially at the subgroup level. While each specific model excels in predicting a particular soil classification, the cumulative ensemble models consistently outperform individual models in terms of overall performance and accuracy, underscoring the effectiveness of ensemble modeling in improving spatial soil classification.
کلیدواژهها [English]
- Boosted Regression Trees
- Data balancing
- Imbalanced dataset
- Minority
- Abeare, S. 2009. Comparisons of boosted regression tree, GLM and GAM performance in the standardization of yellowfin tuna catch-rate data from the Gulf of Mexico lonline [sic] fishery. Louisiana State University and Agricultural and Mechanical College.
- Adhikari, K., Hartemink, A.E., Minasny, B., Bou Kheir, R., Greve, M.B. and Greve, M.H. 2014. Digital mapping of soil organic carbon contents and stocks in Denmark. PloS one, 9(8), p. e105519.
- Adhikari, K., Owens, P.R., Ashworth, A.J., Sauer, T.J., Libohova, Z., Richter, J.L. and Miller, D.M. 2018. Topographic controls on soil nutrient variations in a silvopasture system. Agrosystems, Geosciences and Environment, 1(1):.1-15.
- Artieda, O., Herrero, J., and Drohan, P. J. 2006. Refinement of the differential water loss method for gypsum determination in soils. Soil Science Society of America Journal, 70(6): 1932-1935.
- Błaszczyński, J. and Stefanowski, J. 2015. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150: 529-542.
- Bouyoucos, G. J. 1962. Hydrometer method improved for making particle size analyses of soils 1. Agronomy Journal, 54(5): 464-465.
- Branco, P., Torgo, L. and Ribeiro, R.P. 2016. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys, 49(2): 1-50.
- Breiman, L. 2001. Random forests. Machine Learning, 45(1): 5-32.
- Breiman, L., and Cutler, A. 2004. Random Forests. Department of Statistics, University of Berkeley.
- Brungard, C. W., Boettinger, J. L., Duniway, M. C., Wills, S. A., and Edwards Jr, T. C. 2015. Machine learning for predicting soil classes in three semi-arid landscapes. Geoderma, 239: 68-83.
- Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, D.T., Duan, Z. and Ma, J. 2017. A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena, 151: 147-160.
- Chen, S., Mulder, V.L., Heuvelink, G.B., Poggio, L., Caubet, M., Dobarco, M.R., Walter, C. and Arrouays, D. 2020. Model averaging for mapping topsoil organic carbon in France. Geoderma, 366: 114237.
- Chen, S., Xue, J. and Shi, Z. 2023. Spectral-guided ensemble modelling for soil spectroscopic prediction. Geoderma, 437: 116594.
- Congalton, R.G. 1991. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment, 37(1): 35-46.
- Diks, C.G. and Vrugt, J.A. 2010. Comparison of point forecast accuracy of model averaging methods in hydrologic applications. Stochastic Environmental Research and Risk Assessment, 24, pp.809-820.
- Dobarco, M.R., Arrouays, D., Lagacherie, P., Ciampalini, R. and Saby, N.P. 2017. Prediction of topsoil texture for Region Centre (France) applying model ensemble methods. Geoderma, 298: 67-77.
- Elith, J., Leathwick, J.R. and Hastie, T. 2008. A working guide to boosted regression trees. Journal of Animal Ecology, 77(4): 802-813.
- Fatehi, Sh., Mohammadi, J., Salehi, M., Momeni, A., Tomanian, T., Jafari, A. 2014. Spatial de-clustering of traditional soil map using multi-class logistic regression and classification trees). Case study: Merck watershed sub-basin in Kermanshah province (14th Congress of Soil Sciences of Iran), Rafsanjan. 208-213. (In Persian)
- Galar, M., Fernández, A., Barrenechea, E. and Herrera, F. 2013. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 46(12): 3460-3471.
- Garg, K.K., Anantha, K.H., Nune, R., Akuraju, V.R., Singh, P., Gumma, M.K., Dixit, S. and Ragab, R. 2020. Impact of land use changes and management practices on groundwater resources in Kolar district, Southern India. Journal of Hydrology: Regional Studies, 31: 100732.
- Górecki, T. and Krzyśko, M. 2015. Regression methods for combining multiple classifiers. Communications in Statistics-Simulation and Computation, 44(3): 739-755.
- Gruszczynski, ´ S., Gruszczynski, ´ W. 2022. Supporting soil and land assessment with machine learning models using the Vis-NIR spectral response. Geoderma 405: 115451.
- A., Finke P.A, Van deWauw, J., Ayoubi, S., and Khademi, H. 2012. Spatial prediction of USDA-great soil groups in the arid Zarand region, Iran: comparing logistic regression approaches to predict diagnostic horizons and soil types. Europian Journal Soil Science, 63(2): 284–298.
- Jafari, A., Ayoubi, S., Khademi, H., Finke, P.A. and Toomanian, N. 2013. Selection of a taxonomic level for soil mapping using diversity and map purity indices: a case study from an Iranian arid region. Geomorphology, 201: 86-97.
- Jensen, J.R. 1996. Introductory digital image processing: a remote sensing perspective (No. Ed. 2). Prentice-Hall Inc.
- Jeune, W., Francelino, M.R., Souza, E.D., Fernandes Filho, E.I. and Rocha, G.C. 2018. Multinomial logistic regression and random forest classifiers in digital mapping of soil classes in western Haiti. Revista Brasileira de Ciência do Solo, 42.
- Kempen, B., Brus, D.J., Stoorvogel, J.J., Heuvelink, G.B. and de Vries, F. 2012. Efficiency comparison of conventional and digital soil mapping for updating soil maps. Soil Science Society of America Journal, 76(6): 2097-2115.
- Kleinbaum, A.M. 2018. Reorganization and tie decay choices. Management Science, 64(5): 2219-2237.
- Koziarski, M. 2020. Radial-based undersampling for imbalanced data classification. Pattern Recognition, 102: 107262.
- Krawczyk, B., Woźniak, M. and Cyganek, B. 2014. Clustering-based ensembles for one-class classification. Information Sciences, 264: 182-195.
- Kuhn, M. and Johnson, K. 2013. Applied predictive modeling (Vol. 26, p. 13). New York: Springer.
- Lanyon, L. E., and Heald, W. R. 1983. Magnesium, calcium, strontium, and barium. Methods of Soil Analysis: Part 2 Chemical and Microbiological Properties, 9: 247-262.
- Ludwig, B., Murugan, R., Parama, V.R. and Vohland, M. 2019. Accuracy of estimating soil properties with mid‐infrared spectroscopy: Implications of different chemometric approaches and software packages related to calibration sample size. Soil Science Society of America Journal, 83(5): 1542-1552.
- Malone, B.P., Minasny, B., McBratney, A.B., Malone, B.P., Minasny, B. and McBratney, A.B. 2017. Digital Soil Assessments. Using R for Digital Soil Mapping: 245-260.
- Meng, X.T., Yan, F.G., Cao, B.X., Jin, M. and Zhang, Y. 2022. Efficient real-valued DOA estimation based on the trigonometry multiple angles transformation in monostatic MIMO radar. Digital Signal Processing, 123: 103437.
- Olaya, V. 2004. A gentle introduction to SAGA GIS, ‖ The SAGA User Group eV. Gottingen, Germany, 208.
- Perry Jr, C. R., & Lautenschlager, L. F. 1984. Functional equivalence of spectral vegetation indices. Remote Sensing of Environment, 14(1-3): 169-182.
- Pourghasemi, H.R., Kariminejad, N., Amiri, M., Edalat, M., Zarafshar, M., Blaschke, T. and Cerda, A. 2020. Assessing and mapping multi-hazard risk susceptibility using a machine learning technique. Scientific Reports, 10(1): 3203.
- Richards, A. L. (ed). 1954. Diagnosis and improvement of saline and alkaline soils. US Salinity Laboratory Staff. USDA. Handbook, No. 60, Washington DC. USA.
- Rokach, L. 2010. Ensemble-based classifiers. Artificial Intelligence Review, 33, pp.1-39.
- Scull, P., Franklin, J., and Chadwick, O.A. 2005. The application of classification tree analysis to soil type prediction in a desert landscape. Ecological Modelling, 181: 1–15.
- Sharififar, A., Sarmadian, F., Malone, B.P. and Minasny, B. 2019. Addressing the issue of digital mapping of soil classes with imbalanced class observations. Geoderma, 350: 84-92.
- Soil and Water Research Institute. 2010. Site Selection, Soil Survey and Land Evaluation for Development of Orchards in Zanjan Province, Iran. (In Persian)
- Soil science division staff. 2017. "Soil survey manual". USDA Handbook 18: 120-131
- Soil Survey Staff. 2022. Keys to soil taxonomy, 13th edition. USDA Natural Resources Conservation Service.
- Statistical Yearbook of Zanjan Province. 2019. Land and Climate, National Statistics Organization. (In Persian)
- Sumner, M. E., and Miller, W. P. 1996. Cation exchange capacity and exchange coefficients. Methods of soil analysis: Part 3 Chemical methods, 5:1201-1229.
- Supreme Council of Science, Research and Technology. 2013. (In Persian)
- Swiderski, B., Osowski, S., Kruk, M. and Barhoumi, W. 2016. Aggregation of classifiers ensemble using local discriminatory power and quantiles. Expert Systems with Applications, 46: 316-323.
- Sylvain, J.D., Anctil, F. and Thiffault, É. 2021. Using bias correction and ensemble modelling for predictive mapping and related uncertainty: a case study in digital soil mapping. Geoderma, 403: 115153.
- Taghizadeh-Mehrjardi, R., Nabiollahi, K., Minasny, B. and Triantafilis, J. 2015. Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran. Geoderma. 253-254: 67–77.
- Taghizadeh-Mehrjardi, R., Minasny, B., Toomanian, N., Zeraatpisheh, M., Amirian-Chakan, A. and Triantafilis, J. 2019. Digital mapping of soil classes using ensemble of models in Isfahan region, Iran. Soil Systems, 3(2), p.37.
- Taghizadeh-Mehrjardi, R., Mahdianpari, M., Mohammadimanesh, F., Behrens, T., Toomanian, N., Scholten, T. and Schmidt, K. 2020. Multi-task convolutional neural networks outperformed random forest for mapping soil particle size fractions in central Iran. Geoderma, 376, p.114552.
- Tien Bui, D., Shirzadi, A., Shahabi, H., Chapi, K., Omidavr, E., Pham, B.T., Talebpour Asl, D., Khaledian, H., Pradhan, B., Panahi, M. and Bin Ahmad, B. 2019. A novel ensemble artificial intelligence approach for gully erosion mapping in a semi-arid watershed (Iran). Sensors, 19(11): 24-44.
- Vaysse, K. and Lagacherie, P. 2017. Using quantile regression forest to estimate uncertainty of digital soil mapping products. Geoderma, 291: 55-64.
- Vohland, M., Ludwig, B., Seidel, M., Hutengs, C. 2022. Quantification of soil organic carbon at regional scale: Benefits of fusing vis-NIR and MIR diffuse reflectance data are greater for in situ than for laboratory-based modelling approaches. Geoderma, 405: 115426.
- Walkley, A. and Black, I.A. 1934. An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil Science, 37(1): 29-38.
- Xu, Z., Shen, D., Nie, T. and Kou, Y. 2020. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107: 103465.
- Yang, Y., Choi, J.N. and Lee, K. 2018. Theory of planned behavior and different forms of organizational change behavior. Social Behavior and Personality: An International Journal, 46(10): 1657-1671.
- Zhang, Y. and Hartemink, A.E. 2020. Data fusion of vis–NIR and PXRF spectra to predict soil physical and chemical properties. European Journal of Soil Science, 71(3): 316-333.
- Zinck, J.A., Metternicht, G., Bocco, G. and Del Valle, H. 2016. Geopedology. An integration of geomorphology and pedology for soils and landscape studies: Springer International Publishing Switzerland, 556p.