نوع مقاله : مقاله پژوهشی

نویسندگان

1 گروه علوم خاک دانشکده کشاورزی، دانشگاه زنجان، ایران

2 عضو هیات علمی دانشگاه زنجان

3 موسسه تحقیقات خاک و آب، سازمان تحقیقات، آموزش و ترویج کشاورزی، کرج، ایران

4 پژوهشگر گروه علوم خاک پردیس کشاورزی، دانشگاه تهران، ایران

چکیده

علی‌رغم استفاده گسترده از روش های نقشه برداری رقومی خاک در مطالعات خاکشناسی، محدودیت های مربوط به عدم تعادل کلاس های خاک مانع عملکرد موفقیت‌آمیز بسیاری از الگوریتم های یادگیری ماشین در این روش ها شده است. از اینرو هدف از این پژوهش بهبود عملکرد مدل سازی داده‌های نامتعادل خاک با استفاده از روش پیش درمانی نمونه گیری مجدد در سه مدل پیش بینی شامل جنگل تصادفی، درخت تصمیم توسعه یافته و رگرسیون لجستیک چندجمله ای در بخشی از اراضی استان زنجان است. برای این منظور موقعیت 148 خاک رخ مشاهداتی بر اساس الگوی شبکه‌بندی منظم با فاصله 500 متر حفر و بر اساس استانداردهای سیستم جامع رده بندی خاک تشریح و طبقه بندی گردید. متغیرهای محیطی شامل اطلاعات نقشه های ژئومورفولوژی و زمین شناسی، مدل رقومی ارتفاع و داده های حاصل از تصاویر ماهواره‌ای لندست 8 بودند که بر اساس نظر کارشناسی و رویکرد تحلیل مؤلفه اصلی تعدادی از متغیرهای محیطی به‌عنوان مؤثرترین متغیرهای محیطی و ورودی مدل انتخاب گردید. مدل سازی با استفاده از داده‌های نامتعادل، منجر به از دست دادن کلاس‌های با مشاهده های کم تعداد برای هر سه مدل بود. در این شرایط مدل رگرسیون لجستیک چندجمله‌ای بالاترین دقت (66%) و ضریب کاپا (0/41) را نسبت به دو مدل دیگر نشان داد. پس از نمونه برداری مجدد داده ها در قالب فرآیند متعادل سازی، مدل درخت تصمیم توسعه‌یافته با حفظ کلاس های کم تعداد با صحت کلی 75% و ضریب کاپا 0/64 در پیش‌بینی مکانی زیرگروه های خاک، برآورد قابل قبولی ارائه داد.

کلیدواژه‌ها

موضوعات

عنوان مقاله [English]

Improving the classification of Soil imbalanced data using machine learning algorithms in Some Part of Zanjan provice land

نویسندگان [English]

  • mastaneh rahimi mashkale 1
  • Mohammad Amir Delavar 2
  • mohammad jamshidi 3
  • amin sharififar 4

1 Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran

2 Associate Professor, Department of Soil Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran

3 Assistant Professor, Soil and Water Research Institute, Agricultural Research, Education and Extension Organization, Karaj, Iran

4 Researcher, Department of Soil Science, Faculty of Agriculture, University of Tehran, Iran

چکیده [English]

Despite the great use of digital soil maps, the problems of imbalance in classification disrupt the classification performance of many machine learning algorithms, and for this reason, it has attracted the attention of many researchers. Therefore, the aim of this research is to improve the classification of unbalanced soil data using resampling pretreatment technique in three forecasting models including Random forest (RF), Boosted regression trees (BRT) and Multinomial logistic regression (MNLR) in a part of the lands of Zanjan province in Iran.Sampling was done based on a regular grid pattern with 500 meters intervals, and 148 soil surfaces were randomly studied and classified. The region's soils at the subgroup level were in five classes with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. Environmental covariates included geomorphological and geological maps, digital elevation model (DEM), and remote sensing (RS), selected by principal component analysis (PCA) and expert knowledge methods AND a number of environmental variables including geomorphological map information, Geological information and features extracted from the DEM were selected as the most effective environmental variables for predicting soil classes and as input to the model. Extraction of environmental covariates was done in ENVI and SAGA_GIS software and modeling of soil-landscape relationship was done using the aforementioned algorithms in Rstudio software. The resampling technique was applied to the minority and majority soil classes prior to modeling.The results showed that using original data that have imbalanced classes for mapping resulted in loss of the minority classes and relatively low Kappa agreement values and overall accuracy for RF (ovrall=65%, k=0.32) and BRT models (ovrall=60%, k=0.35). However, after resampling the data, two overall accuracy and Kappa coefficient statistics increased in all models. In addition, the BRT model provided an acceptable estimate by maintaining the minority classes and the Kappa coefficient of 0.64 and the overall accuracy of 75% in the spatial prediction of soil subgroups. The producer accuracy (PA) and user accuracy (UA) results showed that the two classes of Gypsic Haploxerepts and Lithic Xerorthents, which were excluded when training using imbalanced datasets in RF and BRT algorithms, showed significant improvement after balancing the data. Results show that they were well predicted in RF algorithm (UA =100%, 78%) and BRT algorithm (UA= 60% and 70%) using treated data. Also, these minority classes showed Producer accuracy in RF algorithm (PA= 75%, 88%) and BRT algorithm (PA=100%, 78%) in compared to zero accuracy when training using imbalanced data. On the other hand, the validation results of the MNLR algorithm showed that despite maintaining the minority classes after balancing the data, the minority classes were predicted with less accuracy. Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in evaluations. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps.



Despite the great use of digital soil maps, the problems of imbalance in classification disrupt the classification performance of many machine learning algorithms, and for this reason, it has attracted the attention of many researchers. Therefore, the aim of this research is to improve the classification of unbalanced soil data using resampling pretreatment technique in three forecasting models including Random forest (RF), Boosted regression trees (BRT) and Multinomial logistic regression (MNLR) in a part of the lands of Zanjan province in Iran.Sampling was done based on a regular grid pattern with 500 meters intervals, and 148 soil surfaces were randomly studied and classified. The region's soils at the subgroup level were in five classes with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. Environmental covariates included geomorphological and geological maps, digital elevation model (DEM), and remote sensing (RS), selected by principal component analysis (PCA) and expert knowledge methods AND a number of environmental variables including geomorphological map information, Geological information and features extracted from the DEM were selected as the most effective environmental variables for predicting soil classes and as input to the model. Extraction of environmental covariates was done in ENVI and SAGA_GIS software and modeling of soil-landscape relationship was done using the aforementioned algorithms in Rstudio software. The resampling technique was applied to the minority and majority soil classes prior to modeling.The results showed that using original data that have imbalanced classes for mapping resulted in loss of the minority classes and relatively low Kappa agreement values and overall accuracy for RF (ovrall=65%, k=0.32) and BRT models (ovrall=60%, k=0.35). However, after resampling the data, two overall accuracy and Kappa coefficient statistics increased in all models. In addition, the BRT model provided an acceptable estimate by maintaining the minority classes and the Kappa coefficient of 0.64 and the overall accuracy of 75% in the spatial prediction of soil subgroups. The producer accuracy (PA) and user accuracy (UA) results showed that the two classes of Gypsic Haploxerepts and Lithic Xerorthents, which were excluded when training using imbalanced datasets in RF and BRT algorithms, showed significant improvement after balancing the data. Results show that they were well predicted in RF algorithm (UA =100%, 78%) and BRT algorithm (UA= 60% and 70%) using treated data. Also, these minority classes showed Producer accuracy in RF algorithm (PA= 75%, 88%) and BRT algorithm (PA=100%, 78%) in compared to zero accuracy when training using imbalanced data. On the other hand, the validation results of the MNLR algorithm showed that despite maintaining the minority classes after balancing the data, the minority classes were predicted with less accuracy. Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in evaluations. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps.

کلیدواژه‌ها [English]

  • Boosted Regression Trees
  • Data Pretreatment
  • Oversampling
  • Resampling Methods
  • Minority Class