Hi, The concerns against the augmentation techniques have been reviewed. Modified the code to allow original images as well when augmentation=True is set. I make sure that the same augmentation is applied for the corresponding labels i.e. table_mask and column_mask as well. I think it would be great to lay down the steps of preprocessing because if there are any discrepancies it will be easy for you to correct. Mask Generation: 1. To create labels I have initialized np.zeros with dtype of np.uint8 to make sure the range of values is [0,255] 2. Using the coordinates of every table in the document (x_min, x_max, y_min, y_max) extracted from .xml files I have filled the initialized array with 255. 3. Hence, for the task of image segmentation this makes two class classification. Preprocessing: 1. For the task of table detection or the structure of the table detection the colors of the image can be treated as noise. 2. To eliminate the color we have converted the input image into Grayscale and again into RGB. 3. To improve the contrast of the image we have done histogram equalization. 4. Finally resize the image into (1024, 1024, 3) All the images are stored in the .jpeg format before sending them into the network. Architecture: 1. There are three upsampling processes in each of the decoder branches. 2. Except for the last Conv2DTranspose I have used 128 filters with no activation. 3. For the last Conv2DTranspose which is the output of the branch I have used 1 filter with sigmoid activation. We want the probability of every pixel belonging to class 1 for calculating Binary_crossentropy. Hence output will be (None, 1024, 1024, 1) For calculating the loss for the branches I have used binary_crossentropy. But you have suggested trying out sparse_categorical_crossentropy. 1. Binary_crossentropy: It is analogous to log loss. The labels must be either 1 or 0. The final output must be the probability of each pixel belonging to class 1. Hence activation is sigmoid 2. Sparse_categorical_crossentropy: It is same as categorical_crossentropy but the labels must be in integer format i.e. 1, 2, 3, ….so on. The final output must be the probability of two classes separately for each pixel. Hence activation is softmax with 2 filters. Since it is not multi-class classification task I’m sticking to the Binary_crossentropy Extensions: In order to extend the case_study I have used DenseNet121, RestNet50 and MobileNet_v2 as feature_extractor as well. Among them DenseNet121 proved best. I have taken two sample examples and plotted the prediction of each of the feature_extractor. DenseNet121 seems working best for even documents containing more no. of tables. The results in the research paper when the TableNet was fine tuned on the ICDAR dataset for f1_score is 0.95 for table_mask detection and 0.90 for column_mask detection. With DenseNet121 I have got 0.7885 and 0.6547 respectively for validation mormot data. From this I’m going to continue the case study with DenseNet121 Feature_extractor alone.