In high dimensions data becomes increasingly sparse, and the conventional methods to detect outliers do not work effectively. I shall discuss the application of two open source libraries in Python to build an application that can reveal the presence of outliers in the high-dimensional noisy data from ASML sensors.
Statistical outlier mining depends on identifying a contrast between inliers and outliers. As the dimensionality of data increases, this contrast starts to decrease owing to what is known as the curse of dimensionality, and the data points start to become equidistant. A way to prevent this is to use dimensionality reduction techniques.
Here we build an application to detect outliers in the noisy data collected by ASML sensors using open source Python packages. We first use a neighbor graph based algorithm, to embed the high dimensional data into low (2-3) dimensions, to aid in easy visualization. This is followed by clustering the embedded data to reveal the distinct types of outliers. All this is achieved in a completely unsupervised way.
The application provides valuable clues to an expert user trying to decipher the sanity of data. A correlation of the physical pattern of the detected outliers with location/time-stamp of the measurements helps in the determination of the reason behind the failed measurements.