Journal of Inorganic Materials ›› 2022, Vol. 37 ›› Issue (12): 1311-1320.DOI: 10.15541/jim20220149

• RESEARCH ARTICLE • Previous Articles     Next Articles

Detection Method on Data Accuracy Incorporating Materials Domain Knowledge

SHI Siqi1,2,5(), SUN Shiyu1, MA Shuchang3, ZOU Xinxin3, QIAN Quan3,4,5, LIU Yue3,4,5()   

  1. 1. Materials Genome Institute, Shanghai University, Shanghai 200444, China
    2. School of Materials Science and Engineering, Shanghai University, Shanghai 200444, China
    3. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
    4. Shanghai Engineering Research Center of Intelligent Computing System, Shanghai University, Shanghai 200444, China
    5. Zhejiang Laboratory, Hangzhou 311100, China
  • Received:2022-03-21 Revised:2022-05-06 Published:2022-12-20 Online:2022-05-27
  • Contact: LIU Yue, professor. E-mail: yueliu@shu.edu.cn
  • About author:SHI Siqi (1978-), male, PhD, professor. E-mail: sqshi@shu.edu.cn
  • Supported by:
    National Key Research and Development Program of China(2021YFB3802101);National Natural Science Foundation of China(52073169);Key Research Project of Zhejiang Laboratory(2021PE0AC02)

Abstract:

Due to the characteristics of small samples, high dimensions, and much noise, materials data often produce inconsistent results with those obtained from domain experts when used for machine learning modeling. For the whole process of machine learning, developing machine learning models embedding materials domain knowledge is a solution to this problem. The accuracy of materials data directly affects the reliability of data-driven materials performance prediction. Here, a data accuracy detection method incorporating materials domain knowledge is proposed by focusing on the data preprocessing stage in the machine learning application process. Firstly, a materials domain knowledge database is constructed based on the knowledge from materials experts. Secondly, it is coordinated with the data-driven data accuracy detection method to perform single-dimensional data accuracy detection based on the rule for value of descriptors, multi-dimensional data correlation detection based on the rule for correlation of descriptors, and full-dimensional data reliable detection based on multi-dimensional similar sample identification strategy from both data and domain knowledge perspectives. For the anomalous data identified at each stage, they are corrected by incorporating the materials domain knowledge. Furthermore, domain knowledge is incorporated into the whole process of the data accuracy detection method to ensure high accuracy of the dataset from the initial stage. Finally, experiments on the NASICON-type solid electrolyte activation energy prediction dataset demonstrate that this method can effectively identify anomalous data and make reasonable corrections. Compared with the original dataset, the prediction accuracy of all six machine learning models based on the revised dataset is improved to different degrees, among which R2 achieves a 33% improvement on the optimal model.

Key words: machine learning, materials science, data quality, domain knowledge

CLC Number: