[1]刘 鹏,叶 宾.面向高维缺失数据集的线性判别分析方法[J].常州大学学报(自然科学版),2020,32(02):31-37.[doi:10.3969/j.issn.2095-0411.2020.02.004]
 LIU Peng,YE Bin.Linear Discriminant Analysis for High-Dimensional Dataset with Missing Observations[J].Journal of Changzhou University(Natural Science Edition),2020,32(02):31-37.[doi:10.3969/j.issn.2095-0411.2020.02.004]
点击复制

面向高维缺失数据集的线性判别分析方法()
分享到:

常州大学学报(自然科学版)[ISSN:2095-0411/CN:32-1822/N]

卷:
第32卷
期数:
2020年02期
页码:
31-37
栏目:
计算机与信息工程
出版日期:
2020-03-28

文章信息/Info

Title:
Linear Discriminant Analysis for High-Dimensional Dataset with Missing Observations
文章编号:
2095-0411(2020)02-0031-07
作者:
刘 鹏叶 宾
(中国矿业大学 地下空间智能控制教育部工程研究中心,江苏 徐州 221116; 中国矿业大学 信息与控制工程学院,江苏 徐州 221116)
Author(s):
LIU Peng YE Bin
(Engineering Research Center of Intelligent Control for Underground Space, Ministry of Education, China University of Mining and Technology, Xuzhou 221116, China; School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China)
关键词:
线性判别分析 缺失数据 高维数据 Lasso估计
Keywords:
linear discriminant analysis missing data high-dimensional data Lasso estimation
分类号:
TP 181
DOI:
10.3969/j.issn.2095-0411.2020.02.004
文献标志码:
A
摘要:
线性判别分析尽管在许多实际应用中表现良好,但是它在处理含有缺失数据的高维数据集时,效果却很不理想。这一方面是由于线性判别分析方法无法准确地预测或填充缺失数据,另一方面是由于在高维情况下,线性判别分析使用的样本协方差矩阵不再是总体协方差矩阵的一个良好估计。因此导致计算出的判别函数值产生很大的偏差。基于随机矩阵理论,采用总体协方差矩阵的Lasso估计,提出了一种处理高维缺失数据集的线性判别分析改进方法。在多种人造及真实数据集上的仿真结果表明,所提方法的分类正确率优于其他同类算法。
Abstract:
Although it performs well in many applications, Linear discriminant analysis(LDA)is impractical for high-dimensional datasets with missing observations. One of the reasons for it is that most of the classification methods cannot predict or impute the missing values correctly, the other reason is that the sample covariance matrix used in LDA is no longer a good estimator of the population covariance matrix in high dimensions. Therefore, there will be a relatively large deviation for the discriminant function values. Based on the results from random matrix theory and by exploiting a Lasso estimator of the population covariance matrix, an improved LDA classifier for high-dimensional dataset with missing observations is proposed. Simulation results show that our proposed method is superior to the competitors for a wide variety of synthetic and real data sets.

参考文献/References:

[1]张瑞, 蒋晨之, 苏剑波. 基于稀疏特征挑选和概率线性判别分析的表情识别研究[J]. 电子学报,2018,46(7):1710-1718.
[2]张靖,胡学钢,李培培,等. 基于迭代 Lasso的肿瘤分类信息基因选择方法研究[J]. 模式识别与人工智能,2014,27(1): 49-59.
[3]刘丽萍.大维数据背景下金融协方差阵的估计及应用[J]. 系统工程理论与实践,2017,37(3):597-606.
[4]FRIEDMAN J H. Regularized discriminant analysis[J]. Journal of the American Statistical Association, 1989, 84(405): 165-175.
[5]SUN P Y, BAO K W, LI H H, et al. An efficient classification method for fuel and crude oil types based on m/z 256 mass chromatography by COW-PCA-LDA[J]. Fuel, 2018, 222: 416-423.
[6]GANTAYAT S S, MISRA A, PANDA B S. A study of incomplete data-a review[C]//Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications(FICTA)2013. Cham: Springer, 2014: 401-408.
[7]ENDERS C K. Applied missing data analysis[M]. New York: Guilford Press, 2010.
[8]OUNPRASEUTH S, MOORE P C, YOUNG D M. Imputation techniques for incomplete data in quadratic discriminant analysis[J]. Journal of Statistical Computation and Simulation, 2012, 82(6): 863-877.
[9]EL KAROUI N. Spectrum estimation for large dimensional covariance matrices using random matrix theory[J]. The Annals of Statistics, 2008, 36(6): 2757-2790.
[10]JOHNSTONE I M, MA Z. Fast approach to the Tracy-Widom law at the edge of GOE and GUE[J]. Annals of Applied Probability, 2012, 22(5): 1962-1988.
[11]VERSHYNIN R. Introduction to the non-asymptotic analysis of random matrices[R]. Paris: Institut Henri Poincaré, 2011.
[12]LOUNICI K. High-dimensional covariance matrix estimation with missing observations[J]. Bernoulli, 2014, 20(3): 1029-1058.
[13]LOUNICI K. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators[J]. Electronic Journal of Statistics, 2008, 2: 90-102.
[14]BICKEL P J, RITOV Y, TSYBAKOV A B. Simultaneous analysis of Lasso and Dantzig selector[J]. The Annals of Statistics, 2009, 37(4): 1705-1732.
[15]GRANT M, BOYD S, YE Y. CVX: matlab software for disciplined convex programming[EB/OL].(2013-09-01)[2018-04-10].http://cvxy.com/cvx/.
[16]GUO Y, HASTIE T, TIBSHIRANI R. Regularized discriminant analysis and its application in microarrays[J]. Biostatistics, 2005, 1(1): 1-18.
[17]RAMEY J A, STEIN C K, YOUNG P D, et al. High-dimensional regularized discriminant analysis[R]. New York: arXiv, 2016:1602.01182.
[18]SRIVASTAVA M S, KUBOKAWA T. Comparison of discrimination methods for high dimensional data[J]. Journal of the Japan Statistical Society, 2007, 37(1): 123-134.
[19]ALEXANDER K, MATTHIAS T. Imputation with the R package VIM [J]. Journal of Statistical Software, 2016, 74(7): 1-16.
[20]JOSSE J, HUSSON F. missMDA: a package for handling missing values in multivariate data analysis[J]. Journal of Statistical Software, 2016, 70(1): 1-31.
[21]DUA D, KARRA T E. UCI machine learning repository[EB/OL].(1995-11-01)[2018-03-20]. http://archive.ics.uci.edu/ml.

备注/Memo

备注/Memo:
收稿日期:2019-11-21。
基金项目:徐州市应用基础研究计划资助项目(KC18069)。
作者简介:刘鹏(1992—),男,江苏徐州人,硕士生。通信联系人:叶宾(1980—),E-mail:yebin@cumt.edu.cn
更新日期/Last Update: 2020-04-28