Responses to Reviewer Comments Title: A New Method for Multi-Oriented Graphics-Scene-3D Text Classification in Video (PR-D-14-010304R1) Reviewer#1 Comment 1.1: Still the motivation of the paper is not convincing. However, it can be a contribution to the enhancement of performance in text detection research if the text types are classified. It would be much better if samples of scene/graphics/2D/3D texts which are difficult to be detected or recognized by different types of methods are given in the introductory section. For example, it would be very convincing if there is a 3D text which cannot be handled with existing 2D detectors. Just a little bit of difficulty is not enough for supporting the motivation of the paper. Such strongly resisting to different types of detectors will provide a strong motivation of the research. Response 1.1: Thank you very much for your comment and suggestion to improve the quality of the paper. Since capturing graphics and 2D scene texts are common, most of the existing text detection and binarization methods target either graphics or 2D scene texts rather than 3D scene texts. As a result, the developed methods work well for graphics and 2D scene texts but give poor results for 3D scene texts. In order to expand the ability to detect and recognize 3D scene texts, there is a need for the classification of graphics-scene texts and 2D-3D scene texts. This is the main motivation of our work. As per your suggestion, we have illustrated the inability of different existing text detection and recognition methods on graphics, 2D scene and 3D scene text images in Fig. 1 and Fig. 2 in the Introduction Section in the revised manuscript. It is clear from the Fig. 1 and Fig. 2 that the existing text detection and recognition methods give reasonably good results for graphics and 2D scene texts, but give worst results for 3D scene texts. We have also modified the content in the Introduction Section according to the new Fig. 1 and Fig. 2 in the revised manuscript. Reviewer#2 Comment 2.1: While the description becomes clearer, it raises another issue, the scene text and graphics both could be 2D and 3D. From the description, in the abstract and in the introduction, it seems that 2D and 3D are entirely different categories. why is it so? I would suggest a proper justification for this. Authors mentioned it 3D scene text, why are these separate categories? Response 2.1: Thank you very much for your comment. In the coming future, we believe due to the invent of advanced technologies, one day we can see only 3D TV to show either 3D graphics texts or 2D/3D natural scene texts. This will be very similar to the fact that old analog TVs have been replaced by smart TVs in the past decade. In this situation, we can expect both 3D graphics texts and 3D scene texts in video. However, in our work, due to the non-availability of 3D news video at present, the current scope of the work is limited to 2D graphics and 3D graphics texts are not discussed in the present work. We have added suitable explanations to the Introduction Section in the revised manuscript. As a result, we can expect 2D and 3D only in the case of scene text and 2D in the case of graphics text as the latter is edited while the former is a part of natural images. Because of depth information, we can see 3D effect for a 3D text, which results in extra edges and covers more background information. For comparison, we may not see extra edges in the case of 2D scene text, which looks flat with no extra edges. It is true that if a text produces extra edges, the shape of the character will be affected. Since all the existing text detection and recognition methods directly or indirectly depend on shape information, these methods give poor results for 3D text detection and recognition. This is the main reason to develop a classification method such that we can tune the conventional methods for achieving good accuracies for both 2D and 3D scene texts appearing in real life. As suggested by reviewer-1, we have also illustrated an example to show the inability of the existing text detection and recognition methods on 3D scene text in Fig. 1, Fig. 2, which clearly explain the reason and motivation for our classification. Comment 2.2: Serious Observation regarding the presentation of the paper: a- Presentation of the equations 2,3,4, and 5; they look not so good. b- Flow chart needs a better presentation distinguishing the both steps by highlighting them. c- Fig 16; all columns need proper captions d- In general, the figures are having one caption for all 'N' images in a row, each sub-image requires a caption to better illustrate the purpose. Response 2.2: Thanks for pointing out minor mistakes in presentation. For (a), we have rewritten the equations (2)-(5) in Section 3.3 in the revised manuscript. For (b), we have modified the flow chart to show clear steps for the classifications of graphics-scene texts and 2D-3D scene texts in Section 3 in the revised manuscript. For (c), we have included captions for each column for the Fig. 16 in the revised manuscript. For (d), we have taken care of such mistakes in the revised manuscript. Comment 2.3: Thanks a lot for making the changes, now the motivation is very clear. Please see comment 1 to justify the categorization. Response 2.3: Thanks for your compliment. The justification of categories is illustrated in Fig. 1 and Fig. 2 in the introduction section in the revised manuscript. Please also refer the response of comment 1.1. Comment 2.4: Comment 2.4: Thanks a lot for the illustrations. Response 2.4: Thank you very much. Comment 2.5: Thanks for the explanation. Response 2.5: Thank you very much. Comment 2.6: Comment 2.6: Thanks for the explanation. However, it is still not clear that which step(s) in your methods boost(s) the detection of 3D scene text. The authors must show what is the important component in their approach which allows us to detect 3D scene text which other methods are unable to detect. Response 2.6: Thanks for your compliments. Text detection involves text candidates, potential text candidates and merging steps for extracting full texts in images. As discussed in Section 3.2, 3.3 and 3.4, the method uses gradient information and k-means clustering for text candidate selection from the results of static and dynamic clusters. It is true that gradient provides high values for high contrast pixels and low values for low contrast pixels. Since the method extracts gradient at pixel level, there is no difference between 2D and 3D texts. In the same way, since k-means clustering is unsupervised, it classifies high gradient values into one cluster and low gradient values into another cluster. The cluster which gives high values is considered as a text cluster. Similarly, for potential text candidates, the proposed method considers the symmetry between inter and intra characters for defining symmetry. This symmetry does not change as text type changes because of the uniform spacing between characters. For extracting a full text, the proposed method looks for the nearest neighbor based on geometrical properties of components. Since the spacing between characters is lesser than the spacing between text lines, the method works well irrespective of text types. Therefore, we can conclude that gradient with k-means clustering, symmetry between character components and uniform spacing between character components are important components for detecting text regardless of text types. When we compare with the existing methods, such as Epstein et al. and others which use edge information at component level for text detection, the proposed method gives better results than those of the existing methods as shown in Table 3. It is also evident from Table 13 that the existing methods give poor accuracies for 3D text compared to 2D text. This shows that the existing methods are not adequate to achieve good accuracies for both 2D and 3D texts. We have included one para about the important components which work well irrespective of text types for text detection at the end of Section 3.4 in the revised manuscript. Comment 2.7: In general, it seems that the method might not be applicable in any scenario, except, given appropriate sequence of videos which meet the prerequisites of the method. It happens generally in the videos that there exists a temporal coherency but what happens when it fails, no justification is provided. Response 2.7: Thank you very much raising this point. According to our knowledge, video captures 2530 frames per second. We also know that a text appears on a few temporal frames at the same location with/without slight movements in order to be readable. For this work, video containing texts is inputted for text detection and classification. In this sense, generally video provides a sequence of temporal information. If video does not have temporal coherency and sequence, the iterative procedure presented in 3.1 still automatically determines static and dynamic clusters based on the converging criterion because it requires at least two temporal frames. This may lead to converge quickly. At the same time, the iterative procedure determines the number of frames used for the classification of clusters. If there is no temporal coherency, the iterative procedure checks all the 25-30 frames and then terminates. Since the features used for the iterative procedure are deviation based at pixel level, different text types do not affect the iterative procedure. We have included a few sentences at the end of Section 3.1 in the revised manuscript. Comment 2.8: Overall, the method considers videos. A general comment is, this method might fail as it depends on some unrealistic assumptions. The assumptions are - First of all a keyframe has been selected, then the temporal frames are used to refine the clusters. If the text which appears in the keyframe does not appear anymore, how the performance is affected. The methods depend heavily on the similarity of the keyframe with other neighboring frames. -Second, the direction of the spatial neighboring has not been defined, whether it is bidirectional or unidirectional, if it is unidirectional--the instability might increase. - Finally, what are the limits of the methods to be stable, the number of required frames? Is there any bound? Response 2.8: Thank you very much for your comment. As discussed in Response 2.7, the input for this work is video that has text information. Since the scope of the work is the classification of graphics-scene and 2D-3D scene texts, the method considers the video which lasts one second as the input. As a result, the method considers the first frame as a keyframe and uses unidirectional for searching neighbor temporal frames. It is true that selecting keyframe and ensuing temporal coherency is not so easy, which is beyond the scope of our work. We will consider your suggestion for our future work. Actually, there are methods available in the literature for text frame identification from a large number of frames in video. We can utilize these methods for text frame identification. The iterative procedure presented in Section 3.1 determines the number of frames automatically based on the converging criterion. Since an input video lasts one second, the iterative procedure may search all the 30 frames if text moves arbitrarily because the iterative procedure expects unidirectional text with slightly arbitrary movements. We have added suitable changes to Section 3.1, Section 3.7 and Section 5 in the revised manuscript. Comment 2.9: Overall question regarding the paper: The clusters which authors have shown in the illustrations, could be obtained by an appropriate edge detection, which could further be used for the selection of text and non-text. The authors need to provide a valid justification that applying their method without clustering on the edge image is better when we exploit the temporal classification. Response 2.9: Thank you very much for your comment and suggestions. We agree that we can explore edge properties for obtaining static and dynamic clusters with the help of temporal information, which can be used further for classification. As we discussed in the Introduction Section, the method considers the problem that edge properties may not work well for classifying static and dynamic clusters due to the effect of 3D texts and the variations in scene texts. Therefore, we have proposed the deviation of each pixel for obtaining static and dynamic clusters, which works well for any type of texts. In this way, static and dynamic clusters are classified with the help of deviation. Comment 2.10: In a nutshell, this paper still needs to go through a major revision. Response 2.10: Thank you very much for your valuable suggestion, which helps us to improve both quality and clarity in a great way. According to your comments and suggestions, we have modified the paper and the reflections can be seed in the revised manuscript.