Anomaly Detection with Autoencoder Debasish Deb Dataset 984 files Each file contains data in 20480 rows and 4 columns The IMS-Rexnord Bearing Data includes three datasets, each describing a test-to-failure experiment. The datasets consist of 1-second vibration signal snapshots recorded at specific intervals, with each file containing 20,480 points sampled at 20 kHz. We selected one data point every 10 minutes by taking the mean absolute value of all data in each file from the second test data *The data was generated by the NSF I/UCR Center for Intelligent Maintenance Systems with support from Rexnord Corp. Input data timeline Dense Autoencoder • The model consists of four fully connected layers. • The input layer has 3 nodes with LeakyReLU activation function. • The hidden layers have 2 and 3 nodes respectively, also with LeakyReLU activation function. • The output layer has the same number of nodes as the input layer and uses the linear activation function. • The model is compiled using the Adam optimizer, mean squared error loss function and the R-squared metric. • The model is trained on the training data for 100 epochs with a batch size of 10 and early stopping. • The data is shuffled and 5% of the data is used for validation during training. LSTM Autoencoder • A sequence-to-sequence autoencoder consists of LSTM neural networks in Python using the Keras library. • The encoder reduces the input 3D input tensor data into a lowerdimensional representation using two LSTM layers , while the decoder reconstructs the original input data from the lowerdimensional representation using one RepeatVector and two LSTM layer . • The model is defined using the create_model function that takes in a 3D input tensor X and the output layer uses the TimeDistributed function to apply a dense layer with input_dim units to each time step of the input sequence and returns a Keras model object. • The model is compiled using the Adam optimizer and the mean squared error loss function. Additionally, a custom metric called r_square is defined and included in the metrics list. • The autoencoder model is trained using the fit function with the training data and a validation split of 5%. Reshape data from 2D to 3D In LSTM, data needs to be reshaped into a 3D format [samples, timesteps, features]. Here, we have two datasets - train_data and test_data. Using numpy, we reshape the datasets into the required format using the reshape() function. For instance, train_data is reshaped into train_X with dimensions [samples, timesteps, features]. The first dimension represents the number of samples, the second dimension represents the number of time steps, and the third dimension represents the number of features. We use the shape() function to determine the dimensions of the original dataset, and pass those dimensions as arguments to reshape() function. Finally, we print the shapes of the newly reshaped train_X and test_X datasets using the shape attribute. Distribution of loss • We can use the autoencoder to reconstruct the input data and compute the Mean Absolute Error (MAE) between the predicted and actual values. • Here, we have trained the autoencoder on the training data and visualized the loss distribution using a histogram. • The train_pred variable contains the reconstructed training data using the trained autoencoder model. • We calculate the MAE loss between the predicted and actual training data and store it in the train_result variable. • We visualize the loss distribution using a histogram plotted with the Seaborn library. • The histogram shows the distribution of the MAE loss values, with the x-axis representing the loss values and the y-axis representing the frequency of occurrence. • We have limited the x-axis to values between 0 and 0.02 and the y-axis to values between 0 and 200 for better visualization. Anomaly Detection Anomaly detection on test data using autoencoder Step 1: Predict test data using the trained autoencoder model Step 2: Calculate mean absolute error (MAE) between the predicted and original test data Step 3: Define a threshold value to mark a data point as anomalous Step 4: Create a DataFrame with the calculated loss MAE, threshold, and anomaly status for each data point in the test set Step 5: View the last 500 data points in the test set and their anomaly status in the created DataFrame Anomalies are marked as True if their loss MAE is greater than the defined threshold, and False otherwise