نوع مقاله : مقاله پژوهشی
نویسندگان
1 دانشجوی دکتری علوم خاک، دانشکده کشاورزی، دانشگاه شیراز، شیراز، ایران
2 استاد گروه علوم خاک، دانشکده کشاورزی، دانشگاه شیراز ، شیراز، ایران
3 استاد گروه علوم خاک، دانشکده کشاورزی، دانشگاه شیراز، شیراز، ایران
4 دانشیار گروه علوم خاک، دانشکده کشاورزی، دانشگاه شهید باهنر کرمان، کرمان، ایران
چکیده
تعداد متغیرهای محیطی مورد استفاده برای نقشهبرداری رقومی خاک به سرعت افزایش یافته است، که انتخاب و تمرکز بر روی مهمترین متغیرهای کمکی را با چالش روبهرو کرده است. از طرفی، شناسایی همه متغیرهای محیطی به منظور دستیابی به اطلاعات مکانی برای بهبود پیشبینیها، سودمند است. در این راستا، الگوریتمهای انتخاب ویژگی با شناسایی متغیرهای کمکی مرتبط، به کاهش ابعاد مدل پیشبینی کننده کمک میکنند. در مطالعه حاضر، چهار تکنیک مختلف انتخاب ویژگی شامل عامل تورم واریانس (VIF)، تجزیه مولفههای اصلی (PCA)، باروتا (Boruta) و حذف ویژگی بازگشتی (RFE) به منظور تولید مجموعهای بهینه از متغیرهای کمکی، برای پیشبینی مکانی کلاسهای خاک در سطح گروه بزرگ به کمک مدل جنگل تصادفی بکار گرفته شد. مقایسه تکنیکهای مختلف انتخاب ویژگی در تخمین کلاسهای خاک، با استفاده از معیارهای ارزیابی دقت و ضریب کاپا بین مقادیر مشاهدهشده و پیشبینیشده، انجام شد. نتایج نشان داد، با استفاده از متغیرهای انتخاب شده توسط روشهای مختلف انتخاب ویژگی نسبت به کاربرد همه متغیرها در مدل، دقت پیشبینی تا حدودی افزایش یافت. همچنین در میان چهار رویکرد انتخاب ویژگی، بهبود عملکرد پیشبینی متفاوت بود. روش VIF و PCA به ترتیب بیشترین و کمترین دقت و ضریب کاپا را داشتند، در حالی که روش باروتا با کمترین تعداد متغیر توانست بعد از VIF عملکرد مدل را بهبود بخشد. بهطور کلی یافتهها نشان داد، کاربرد روشهای انتخاب ویژگی میتواند از وابستگی قابلتوجه متغیرهای کمکی مربوطه برای پیشبینی کلاسهای خاک استفاده کند و دقت مدلسازی را بهبود بخشد.
کلیدواژهها
موضوعات
عنوان مقاله [English]
Evaluation of different feature selection algorithms for improving the spatial prediction of soil classes
نویسندگان [English]
- Vahideh Sadeghizadeh 1
- seyed ali abtahi 2
- Majid Baghernejad 3
- Azam Jafari 4
- Seyed Ali Akbar Moosavi 3
1 Ph.D. Student, Department of Soil Science, College of Agriculture, Shiraz University, Shiraz, Ira
2 Professor, Department of Soil Science, College of Agriculture, Shiraz University, Shiraz, Iran
3 Professor, Department of Soil Science, College of Agriculture, Shiraz University, Shiraz, Iran
4 Associate Professor, Department of Soil Science, College of Agriculture, Shahid Bahonar University of Kerman, Kerman, Iran
چکیده [English]
Introduction The number of environmental variables used in digital soil mapping has increased rapidly, which has made it a challenge to select and focus on the most important covariates. No environmental covariates have the same predictability in modeling, and some covariates may introduce noise that reduces the predictive power of the models used. On the other hand, it is beneficial to identify all environmental variables to obtain spatial information that can improve predictions. In this regard, the feature selection algorithms help reduce the dimensions of the predictive model by identifying the associated covariates. Therefore, this study aims to investigate different feature selection algorithms in the selection of auxiliary variables and evaluation their effect on the predictive model.
Materials and Methods The area under study is a part of Darab city in the southeast of Fars province with an area of about 31000 hectares. In the study area 140 profiles were determined and excavated according to the diversity of geomorphological units and thus the type of soils. After excavating the profiles and checking the morphological characteristics of each soil profile, a sufficient amount of soil samples were collected from the genetic horizons and transported to the laboratory for further analysis. Some of the physical and chemical parameters of soils were tested using accepted techniques after air drying and passing through a 2 mm sieve. Finally, all profiles up to the great group level were classified using the U.S. Soil Taxonomy based on the data collected from field observations and the outcomes of laboratory analysis. Environmental variables include the parameters derived from the Digital Elevation Model, Landsat 8 images, geology and geomorphology maps of the study area. All parameters were derived using ArcGIS, SAGAGIS and ENVI softwares. In the present study, four different feature selection techniques including Variance Inflation Factor (VIF), Principal Component Analysis (PCA), Boruta and Recursive Feature Elimination (RFE), were used to identify an optimal set of covariates for predicting spatial classification of soil classes at the great group level. In addition, a Random Forest model (RF) with 10-fold cross-validation and the 5-repeat method, was used to compare different feature selection strategies in soil class mapping. The comparison of different feature selection techniques in estimating soil classes, was based on the evaluation criteria of accuracy and Kappa coefficient between observed and predicted values.
Results and Discussion The results showed that the prediction accuracy increased by using variables selected with different feature selection methods compared to using all variables in the model. In addition, the improvement in predictive performance is different between the four types of feature selection. The VIF and PCA methods had the highest and lowest accuracy index and Kappa coefficient, respectively. The Boruta method, with the lowest number of variables, improved the model's performance after the VIF method. However, the Kappa coefficient showed poor agreement between predicted and observed values for all approaches. The imbalance of soil classes could be a reason for decreasing the accuracy index and Kappa coefficient. However, the random forest model, with and without feature selection methods, identified all soil great groups in the study area. Therefore, it can be concluded that the Random Forest algorithm is a very powerful technique for spatial prediction of soil classes in the study area. Although the performance of the model varied using different feature selection algorithms, the predicted soil maps had similar spatial patterns. Based on the prediction of model with the variables selected by the VIF, the resulting map indicates that Ustorthents soils are mainly located in high altitude regions with steep slopes. Haplustepts, Calciustepts, and Calciusterts great groups have developed in places with low to medium slopes. Haplosalids have developed downstream of the salt dome. Great groups of Ustifluvents were discovered in fluvial sedimentary plains. Endoaquepts were found in the floodplains, which had the smallest area on the predicted map.
Conclusion Overall, the findings indicate that the feature selection methods can utilize significant dependencies among relevant covariates to predict soil classes and to improve modeling accuracy. In the current study, the environmental factors, obtained from the Digital Elevation Model, were selected as key variables, showing the importance of topography and morphology in the classification of soil types in the area. Although the selected variables improved the performance of the model, the prediction of soil classes was random. This could be attributed to the imbalance of soil classes.
کلیدواژهها [English]
- Digital Soil Mapping
- Feature Selection
- Covariates
- Random Forest
- Arrouays, D., Lagacherie, P., and Hartemink, A. E. 2017. Digital soil mapping across the globe. Geoderma Regional, 9, 1-4.
- Boulesteix, A. L., Janitza, S., Kruppa, J., and König, I. R. 2012. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493-507.
- Bouslihim, Y., Rochdi, A., and Paaza, N. E. A. 2021. Machine learning approaches for the prediction of soil aggregate stability. Heliyon, 7(3), e06480.
- Chen, Y., Ma, L., Yu, D., Zhang, H., Feng, K., Wang, X., and Song, J. 2022. Comparison of feature selection methods for mapping soil organic matter in subtropical restored forests. Ecological Indicators, 135, 108545.
- Congalton, R. G. 1991. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment, 37(1), 35-46.
- Degenhardt, F., Seifert, S., and Szymczak, S. 2019. Evaluation of variable selection methods for random forests and omics data sets. Briefings in Bioinformatics, 20(2), 492-503.
- Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., ... and Lautenbach, S. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27-46.
- Ferhatoglu, C., and Miller, B. A. 2022. Choosing feature selection methods for spatial modeling of soil fertility properties at the field scale. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems (pp. 1-2).
- Gee G.W. and Bauder J.W. 1986. Particle size analysis. In: Klute A. Methods of Soil Analysis. Part 1. Physical properties. American Society of Agronomy. Madison. Wisconsin. pp: 383-411.
- Geological Survey of Iran, 1995. Geological Quadrangle Map. NoI11. Geology Organization of Iran.
- Jafari, A., Finke, P. A., Vande Wauw, J., Ayoubi, S., and Khademi, H. 2012. Spatial prediction of USDA‐great soil groups in the arid Zarand region, Iran: comparing logistic regression approaches to predict diagnostic horizons and soil types. European Journal of Soil Science, 63(2), 284-298.
- Jenny Jenny, H. (1941). Factors of Soil Formation: A System of Quantitative Pedology. Mineola.
- Khaleghi, M., Jafari, A., and Farpour, M. H. 2019. Digital Soil Mapping using legacy soil data: Case study of Faryab region of Kerman. Journal of Agricultural Engineering Soil Science and Agricultural Mechanization,(Scientific Journal of Agriculture), 41(4), 31-48. . (in Persian with English abstract).
- Landis, J.R., and Koch, G.G. 1977. The measurement of observer agreement for categorical data. Biometrics, 33:159–174.
- Loeppert, R. H., and Suarez, D. L. 1996. Carbonate and gypsum. Methods of Soil Analysis: Part 3 Chemical Methods, 5, 437-474.
- McBratney, A. B., Santos, M. M., and Minasny, B. 2003. On digital soil mapping. Geoderma, 117(1-2), 3-52.
- McBratney, A., Field, D. J., and Koch, A. 2014. The dimensions of soil security. Geoderma,213, 203-213.
- Minasny, B., and McBratney, A. B. 2016. Digital soil mapping: A brief history and some lessons. Geoderma, 264, 301-311.
- Mousavi S.R., Sarmadian F., Rahmani A., and Khamoushi S.E. 2019. Digital soil mapping with regression classification approaches by RS and Geomorphometrics covariates in the Qazvin plain, Iran. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.
- Nelson, D. W., and Sommers, L. 1983. Total carbon, organic carbon, and organic matter. Methods of Soil Analysis: Part 2 Chemical and Microbiological Properties, 9, 539-579.
- Omuya, E. O., Okeyo, G. O., and Kimwele, M. W. 2021. Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 174, 114765.
- Pereira, P., Bogunovic, I., Muñoz-Rojas, M., and Brevik, E. C. 2018. Soil ecosystem services, sustainability, valuation and management. Current Opinion in Environmental Science and Health, 5, 7-13.
- Picard, R. R., and Cook, R. D. 1984. Cross-validation of regression models.Journal of the American Statistical Association, 79(387), 575-583.
- Rhoades, J. D. 1996. Salinity: Electrical conductivity and total dissolved solids. Methods of Soil Analysis: Part 3 Chemical Methods, 5, 417-435.
- Sharififar, A., Sarmadian, F., and Minasny, B. 2019. Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Computers and Electronics in Agriculture, 159, 110-118.
- Soil Survey Staff. 2014. Soil Taxonomy: A basic systems of Soil Classification for making and interpreting soil surveys. Twelfth Edition. NRCS. USDA.
- Sumner, M. E., and Miller, W. P. 1996. Cation exchange capacity and exchange coefficients. Methods of Soil Analysis: Part 3 Chemical Methods, 5, 1201-1229.
- Tien Bui, D., Tuan, T. A., Klempe, H., Pradhan, B., and Revhaug, I. 2016. Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides, 13, 361-378.
- Wadoux, A. M. C., Minasny, B., and McBratney, A. B. 2020. Machine learning for digital soil mapping: Applications, challenges and suggested solutions. Earth-Science Reviews, 210, 103359.
- Wei, W., Zhou, B., Połap, D., and Woźniak, M. 2019. A regional adaptive variational PDE model for computed tomography image reconstruction. Pattern Recognition, 92, 64-81.
- Zeraatpisheh, M., Ayoubi, S., Jafari, A., Tajik, S., and Finke, P. 2019. Digital mapping of soil properties using multiple machine learning in a semi-arid region, central Iran. Geoderma, 338, 445-452.
- Zeraatpisheh, M., Garosi, Y., Owliaie, H. R., Ayoubi, S., Taghizadeh-Mehrjardi, R., Scholten, T., and Xu, M. 2022. Improving the spatial prediction of soil organic carbon using environmental covariates selection: A comparison of a group of environmental covariates. Catena,208, 105723.
- Zhang, X., Chen, S., Xue, J., Wang, N., Xiao, Y., Chen, Q., and Shi, Z. 2023. Improving model parsimony and accuracy by modified greedy feature selection in digital soil mapping. Geoderma, 432, 116383.
- Zhou, T., Geng, Y., Ji, C., Xu, X., Wang, H., Pan, J., and Lausch, A. 2021. Prediction of soil organic carbon and the C: N ratio on a national scale using machine learning and satellite data: A comparison between Sentinel-2, Sentinel-3 and Landsat-8 images. Science of the Total Environment, 755, 142661.