浏览全部资源
扫码关注微信
1. 上海理工大学光电信息与计算机工程学院,上海 200093
2. 复旦大学计算机科学技术学院,上海 200438
[ "刘文昌(1998- ),男,上海理工大学光电信息与计算机工程学院硕士生,主要研究方向为机器学习、数据分类等" ]
[ "魏赟(1976- ),女,博士,上海理工大学副教授,主要研究方向为分布式系统、网络信息控制等" ]
[ "袁浩轩(1996- ),男,复旦大学计算机科学技术学院博士生,主要研究方向为深度学习、智能频谱感知等" ]
[ "高跃(1978- ),男,博士,复旦大学教授,主要研究方向为卫星互联网、天空地一体化网络、压缩感知与机器学习、智能天线" ]
纸质出版日期:2023-06-30,
网络出版日期:2023-06,
移动端阅览
刘文昌, 魏赟, 袁浩轩, 等. 基于SMOTE和gcForest的医疗小样本数据分类研究[J]. 物联网学报, 2023,7(2):76-87.
WENCHANG LIU, YUN WEI, HAOXUAN YUAN, et al. Research on medical small sample data classification based on SMOTE and gcForest. [J]. Chinese journal on internet of things, 2023, 7(2): 76-87.
刘文昌, 魏赟, 袁浩轩, 等. 基于SMOTE和gcForest的医疗小样本数据分类研究[J]. 物联网学报, 2023,7(2):76-87. DOI: 10.11959/j.issn.2096-3750.2023.00337.
WENCHANG LIU, YUN WEI, HAOXUAN YUAN, et al. Research on medical small sample data classification based on SMOTE and gcForest. [J]. Chinese journal on internet of things, 2023, 7(2): 76-87. DOI: 10.11959/j.issn.2096-3750.2023.00337.
针对传统机器学习模型在医疗小样本数据上由浅层模型结构和复杂数据特征导致的分类表现不佳的问题,提出了一种联合多粒度改进级联森林(cgicForest,combine multi-grained improved cascade forest)模型。通过在多粒度扫描中加入随机抽样环节以及对变换特征进行优化来提高模型表征学习能力,并改进级联森林部分的层级结构来提升模型分类能力。针对存在类别不平衡问题的数据集,提出安全边界过采样(SBS
safe-bo
rderline-SMOTE)算法在属于安全边界的少数样本周围进行动态插值,提高训练数据质量,再通过cgicForest模型进行训练学习,最终得到支持不平衡医疗小样本数据的SBS-cgicForest分类模型。在3种医疗数据集上应用SBS-cgicForest分类模型进行测试,结果表明,cgicForest模型在具有复杂特征的医疗小样本数据上分类的性能指标较多粒度级联森林(gcForest
multi-grained cascade forest)模型提升了4.1~5.4个百分点,与SBS算法结合后各性能指标提升6.6~11.2个百分点,比与传统采样方法结合后的F
1
评分高出2~2.5个百分点,为解决医疗小样本数据的分类问题提供了参考,并为智慧医疗场景下的物联网应用提供了支持。
Aiming at the problem of poor classification performance in traditional machine learning models caused by shallow model structure and complex data characteristics in small medical sample data
an combine multi- grained improved cascade forest (cgicForest) model was proposed.It enhances the representation learning ability of the model by adding random sampling into the multi-grained scanning and optimizing the transformation features.It also enhances the model's classification ability by updating the cascade forest’s hierarchical structure.Considering category imbalance problems in datasets
the safe-borderline-SMOTE (SBS) algorithm was proposed to dynamic interpolate around the few class samples belonging to the safety boundary
which can improve the quality of training data.The cgicForest was applied for training and learning
thus the SBS-cgicForest classification model was obtained which can support imbalanced medical small samples data.The model is used on three medical datasets for classification experiments.The results show that the performance indexes of the cgicForest model in the classification of medical small sample data with complex characteristics have increased by 4.1~5.4 percentage points
compared with the multi-grained cascade forest (gcForest) model.The performance indexes have increase by 6.6~11.2 percentage points after the combination with SBS algorithm
the F
1
score was 2~2.5 percentage points higher than that obtained by traditional sampling methods.It provides a reference for solving the classification problem of small medical sample da
ta
and includes support for internet of things applications in smart medical scenarios.
医疗数据小样本SMOTEgcForest
medical datasmall sampleSMOTEgcForest
周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016.
ZHOU Z H . Machine learning[M]. Beijing: Tsinghua University Publishing House, 2016.
CHEN M, SHI X B, ZHANG Y ,et al. Deep feature learning for medical image analysis with convolutional autoencoder neural network[J]. IEEE Transactions on Big Data, 2021,7(4): 750-758.
HU G S, PENG X J, YANG Y X ,et al. Frankenstein:learning deep face representations using small data[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2018,27(1): 293-303.
李春生, 曹琦, 于澍 . 针对小规模数据集的多模型融合算法研究[J]. 计算机技术与发展, 2020,30(2): 63-66.
LI C S, CAO Q, YU S . Research on multi-model fusion algorithm for small scale data sets[J]. Computer Technology and Development, 2020,30(2): 63-66.
薛参观, 燕雪峰 . 基于改进深度森林算法的软件缺陷预测[J]. 计算机科学, 2018,45(8): 160-165.
XUE C G, YAN X F . Software defect prediction based on improved deep forest algorithm[J]. Computer Science, 2018,45(8): 160-165.
ZHOU Z H, FENG J . Deep forest:towards an alternative to deep neural networks[EB]. 2017.
何宏, 陈叔达 . 面部表情的深度卷积级联森林识别[J]. 小型微型计算机系统, 2021,42(4): 805-809.
HE H, CHEN S D . Deep convolutional cascade forest for facial expression recognition[J]. Journal of Chinese Computer Systems, 2021,42(4): 805-809.
颜建军, 刘章鹏, 刘国萍 ,等. 基于深度森林算法的慢性胃炎中医证候分类[J]. 华东理工大学学报(自然科学版), 2019,45(4): 593-599.
YAN J J, LIU Z P, LIU G P ,et al. Syndrome classification of chronic gastritis based on multi-grained cascade forest[J]. Journal of East China University of Science and Technology, 2019,45(4): 593-599.
CHEN Z H, LI L P, HE Z ,et al. An improved deep forest model for predicting self-interacting proteins from protein sequence using wavelet transformation[J]. Frontiers in Genetics, 2019(10): 90.
UTKIN L, KONSTANTINOV A, MELDO A ,et al. A deep forest improvement by using weighted schemes[C]// Proceedings of 2019 24th Conference of Open Innovations Association (FRUCT). Piscataway:IEEE Press, 2019: 451-456.
GUO Y, LIU S H, LI Z H ,et al. BCDForest:a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data[J]. BMC Bioinformatics, 2018,19(Suppl 5): 118.
HUANG G, LIU Z, VAN DER MAATEN L ,et al. Densely connected convolutional networks[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 2261-2269.
WANG H Y, TANG Y, JIA Z Y ,et al. Dense adaptive cascade forest:a self-adaptive deep ensemble for classification problems[J]. Soft Computing, 2020,24(4): 2955-2968.
UTKIN L V, RYABININ M A . A Siamese deep forest[J]. Knowledge-Based Systems, 2018,139: 13-22.
FAN Y M, QI L, TIE Y . The cascade improved model based deep forest for small-scale datasets classification[C]// Proceedings of 2019 8th International Symposium on Next Generation Electronics (ISNE). Piscataway:IEEE Press, 2019: 1-3.
LIU H, ZHANG N, JIN S G ,et al. Small sample color fundus image quality assessment based on gcforest[J]. Multimedia Tools and Applications, 2021,80(11): 17441-17459.
刘超, 吴申, 郑一超 ,等. 基于深度森林和DNA甲基化的癌症分类研究[J]. 计算机工程与应用, 2020,56(13): 189-193.
LIU C, WU S, ZHENG Y C ,et al. Classification of cancer based on deep forest and DNA methylation[J]. Computer Engineering and Applications, 2020,56(13): 189-193.
XU Z Z . A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data[J]. Information Sciences, 2021(572): 574-589.
PEREIRA R M, COSTA Y M G, SILLA C N Jr . MLTL:a multi-label approach for the Tomek Link under sampling algorithm[J]. Neurocomputing, 2020 (383): 95-105.
YUAN Z W, ZHAO P . An improved ensemble learning for imbalanced data classification[C]// Proceedings of 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). Piscataway:IEEE Press, 2019: 408-411.
REN X Y, YUAN Z Y, HUANG J M . Research on fake reviews detection based on feature construction and Easy Ensemble-RF[C]// Proceedings of 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE). Piscataway:IEEE Press, 2022: 478-482.
XU X L, CHEN W, SUN Y F . Over-sampling algorithm for imbalanced data classification[J]. Journal of Systems Engineering and Electronics, 2019,30(6): 1182-1191.
BREIMAN L . Random forests[J]. Machine Learning, 2001,45(1): 5-32.
GEURTS P, ERNST D, WEHENKEL L . Extremely randomized trees[J]. Machine Learning, 2006,63(1): 3-42.
BREIMAN L . Stacked regressions[J]. Machine Learning, 1996,24(1): 49-64.
吴辰文, 梁靖涵, 王伟 ,等. 一种顺序响应的随机森林:变量预测和选择[J]. 小型微型计算机系统, 2017,38(8): 1762-1766.
WU C W, LIANG J H, WANG W ,et al. Random forest algorithm for sequential response:prediction and selection of variables[J]. Journal of Chinese Computer Systems, 2017,38(8): 1762-1766.
乔健, 诸佳慧, 严康桓 . 基于随机森林CART特征选择改进算法的电信客户流失预测模型[J]. 电信工程技术与标准化, 2022,35(3): 78-82.
QIAO J, ZHU J H, YAN K H . Telecom customer churn prediction model based on improved random forest cart feature selection algorithm[J]. Telecom Engineering Technics and Standardization, 2022,35(3): 78-82.
TAHMASSEBI A, GANDOMI A H, SCHULTE M H J ,et al. Optimized naive-Bayes and decision tree approaches for fMRI smoking cessation classification[J]. Complexity, 2018: 1-24.
HSSINA B, MERBOUHA A, EZZIKOURI H ,et al. A comparative study of decision tree ID3 and C4.5[J]. International Journal of Advanced Computer Science and Applications, 2014,4(2): 126-133.
李孝伟, 陈福才, 李邵梅 . 基于分类规则的C4.5决策树改进算法[J]. 计算机工程与设计, 2013,34(12): 4321-4325,4330.
LI X W, CHEN F C, LI S M.Improved C4 . 5decision tree algorithm based on classification rules[J]. Computer Engineering and Design, 2013,34(12): 4321-4325,4330.
0
浏览量
463
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构