Response-R2

advertisement
Responses to Reviewer Comments
Title: A New Method for Multi-Oriented Graphics-Scene-3D Text Classification in Video
(PR-D-14-010304R1)
Reviewer#1
Comment 1.1: Still the motivation of the paper is not convincing. However, it can be a contribution to the
enhancement of performance in text detection research if the text types are classified. It would be much
better if samples of scene/graphics/2D/3D texts which are difficult to be detected or recognized by
different types of methods are given in the introductory section. For example, it would be very convincing
if there is a 3D text which cannot be handled with existing 2D detectors. Just a little bit of difficulty is not
enough for supporting the motivation of the paper. Such strongly resisting to different types of detectors
will provide a strong motivation of the research.
Response 1.1: Thank you very much for your comment and suggestion to improve the quality of the
paper. Since capturing graphics and 2D scene texts are common, most of the existing text detection and
binarization methods target either graphics or 2D scene texts rather than 3D scene texts. As a result, the
developed methods work well for graphics and 2D scene texts but give poor results for 3D scene texts. In
order to expand the ability to detect and recognize 3D scene texts, there is a need for the classification of
graphics-scene texts and 2D-3D scene texts. This is the main motivation of our work. As per your
suggestion, we have illustrated the inability of different existing text detection and recognition methods
on graphics, 2D scene and 3D scene text images in Fig. 1 and Fig. 2 in the Introduction Section in the
revised manuscript. It is clear from the Fig. 1 and Fig. 2 that the existing text detection and recognition
methods give reasonably good results for graphics and 2D scene texts, but give worst results for 3D scene
texts. We have also modified the content in the Introduction Section according to the new Fig. 1 and Fig.
2 in the revised manuscript.
Reviewer#2
Comment 2.1: While the description becomes clearer, it raises another issue, the scene text and graphics
both could be 2D and 3D. From the description, in the abstract and in the introduction, it seems that 2D
and 3D are entirely different categories. why is it so? I would suggest a proper justification for this.
Authors mentioned it 3D scene text, why are these separate categories?
Response 2.1: Thank you very much for your comment. In the coming future, we believe due to the
invent of advanced technologies, one day we can see only 3D TV to show either 3D graphics texts or
2D/3D natural scene texts. This will be very similar to the fact that old analog TVs have been replaced by
smart TVs in the past decade. In this situation, we can expect both 3D graphics texts and 3D scene texts in
video. However, in our work, due to the non-availability of 3D news video at present, the current scope of
the work is limited to 2D graphics and 3D graphics texts are not discussed in the present work. We have
added suitable explanations to the Introduction Section in the revised manuscript. As a result, we can
expect 2D and 3D only in the case of scene text and 2D in the case of graphics text as the latter is edited
while the former is a part of natural images.
Because of depth information, we can see 3D effect for a 3D text, which results in extra edges and covers
more background information. For comparison, we may not see extra edges in the case of 2D scene text,
which looks flat with no extra edges. It is true that if a text produces extra edges, the shape of the
character will be affected. Since all the existing text detection and recognition methods directly or
indirectly depend on shape information, these methods give poor results for 3D text detection and
recognition. This is the main reason to develop a classification method such that we can tune the
conventional methods for achieving good accuracies for both 2D and 3D scene texts appearing in real life.
As suggested by reviewer-1, we have also illustrated an example to show the inability of the existing text
detection and recognition methods on 3D scene text in Fig. 1, Fig. 2, which clearly explain the reason and
motivation for our classification.
Comment 2.2: Serious Observation regarding the presentation of the paper:
a- Presentation of the equations 2,3,4, and 5; they look not so good.
b- Flow chart needs a better presentation distinguishing the both steps by highlighting them.
c- Fig 16; all columns need proper captions
d- In general, the figures are having one caption for all 'N' images in a row, each sub-image requires
a caption to better illustrate the purpose.
Response 2.2: Thanks for pointing out minor mistakes in presentation. For (a), we have rewritten the
equations (2)-(5) in Section 3.3 in the revised manuscript. For (b), we have modified the flow chart to
show clear steps for the classifications of graphics-scene texts and 2D-3D scene texts in Section 3 in the
revised manuscript. For (c), we have included captions for each column for the Fig. 16 in the revised
manuscript. For (d), we have taken care of such mistakes in the revised manuscript.
Comment 2.3: Thanks a lot for making the changes, now the motivation is very clear. Please see
comment 1 to justify the categorization.
Response 2.3: Thanks for your compliment. The justification of categories is illustrated in Fig. 1 and Fig.
2 in the introduction section in the revised manuscript. Please also refer the response of comment 1.1.
Comment 2.4: Comment 2.4: Thanks a lot for the illustrations.
Response 2.4: Thank you very much.
Comment 2.5: Thanks for the explanation.
Response 2.5: Thank you very much.
Comment 2.6: Comment 2.6: Thanks for the explanation. However, it is still not clear that which step(s)
in your methods boost(s) the detection of 3D scene text. The authors must show what is the important
component in their approach which allows us to detect 3D scene text which other methods are unable to
detect.
Response 2.6: Thanks for your compliments. Text detection involves text candidates, potential text
candidates and merging steps for extracting full texts in images. As discussed in Section 3.2, 3.3 and 3.4,
the method uses gradient information and k-means clustering for text candidate selection from the results
of static and dynamic clusters. It is true that gradient provides high values for high contrast pixels and low
values for low contrast pixels. Since the method extracts gradient at pixel level, there is no difference
between 2D and 3D texts. In the same way, since k-means clustering is unsupervised, it classifies high
gradient values into one cluster and low gradient values into another cluster. The cluster which gives high
values is considered as a text cluster. Similarly, for potential text candidates, the proposed method
considers the symmetry between inter and intra characters for defining symmetry. This symmetry does
not change as text type changes because of the uniform spacing between characters. For extracting a full
text, the proposed method looks for the nearest neighbor based on geometrical properties of components.
Since the spacing between characters is lesser than the spacing between text lines, the method works well
irrespective of text types. Therefore, we can conclude that gradient with k-means clustering, symmetry
between character components and uniform spacing between character components are important
components for detecting text regardless of text types. When we compare with the existing methods, such
as Epstein et al. and others which use edge information at component level for text detection, the
proposed method gives better results than those of the existing methods as shown in Table 3. It is also
evident from Table 13 that the existing methods give poor accuracies for 3D text compared to 2D text.
This shows that the existing methods are not adequate to achieve good accuracies for both 2D and 3D
texts. We have included one para about the important components which work well irrespective of text
types for text detection at the end of Section 3.4 in the revised manuscript.
Comment 2.7: In general, it seems that the method might not be applicable in any scenario, except, given
appropriate sequence of videos which meet the prerequisites of the method. It happens generally in the
videos that there exists a temporal coherency but what happens when it fails, no justification is provided.
Response 2.7: Thank you very much raising this point. According to our knowledge, video captures 2530 frames per second. We also know that a text appears on a few temporal frames at the same location
with/without slight movements in order to be readable. For this work, video containing texts is inputted
for text detection and classification. In this sense, generally video provides a sequence of temporal
information. If video does not have temporal coherency and sequence, the iterative procedure presented in
3.1 still automatically determines static and dynamic clusters based on the converging criterion because it
requires at least two temporal frames. This may lead to converge quickly. At the same time, the iterative
procedure determines the number of frames used for the classification of clusters. If there is no temporal
coherency, the iterative procedure checks all the 25-30 frames and then terminates. Since the features
used for the iterative procedure are deviation based at pixel level, different text types do not affect the
iterative procedure. We have included a few sentences at the end of Section 3.1 in the revised manuscript.
Comment 2.8: Overall, the method considers videos. A general comment is, this method might fail as it
depends on some unrealistic assumptions. The assumptions are
- First of all a keyframe has been selected, then the temporal frames are used to refine the clusters. If the
text which appears in the keyframe does not appear anymore, how the performance is affected. The
methods depend heavily on the similarity of the keyframe with other neighboring frames.
-Second, the direction of the spatial neighboring has not been defined, whether it is bidirectional or
unidirectional, if it is unidirectional--the instability might increase.
- Finally, what are the limits of the methods to be stable, the number of required frames? Is there any
bound?
Response 2.8: Thank you very much for your comment. As discussed in Response 2.7, the input for this
work is video that has text information. Since the scope of the work is the classification of graphics-scene
and 2D-3D scene texts, the method considers the video which lasts one second as the input. As a result,
the method considers the first frame as a keyframe and uses unidirectional for searching neighbor
temporal frames.
It is true that selecting keyframe and ensuing temporal coherency is not so easy, which is beyond the
scope of our work. We will consider your suggestion for our future work. Actually, there are methods
available in the literature for text frame identification from a large number of frames in video. We can
utilize these methods for text frame identification.
The iterative procedure presented in Section 3.1 determines the number of frames automatically based on
the converging criterion. Since an input video lasts one second, the iterative procedure may search all the
30 frames if text moves arbitrarily because the iterative procedure expects unidirectional text with slightly
arbitrary movements. We have added suitable changes to Section 3.1, Section 3.7 and Section 5 in the
revised manuscript.
Comment 2.9: Overall question regarding the paper: The clusters which authors have shown in the
illustrations, could be obtained by an appropriate edge detection, which could further be used for the
selection of text and non-text. The authors need to provide a valid justification that applying their method
without clustering on the edge image is better when we exploit the temporal classification.
Response 2.9: Thank you very much for your comment and suggestions. We agree that we can explore
edge properties for obtaining static and dynamic clusters with the help of temporal information, which can
be used further for classification. As we discussed in the Introduction Section, the method considers the
problem that edge properties may not work well for classifying static and dynamic clusters due to the
effect of 3D texts and the variations in scene texts. Therefore, we have proposed the deviation of each
pixel for obtaining static and dynamic clusters, which works well for any type of texts. In this way, static
and dynamic clusters are classified with the help of deviation.
Comment 2.10: In a nutshell, this paper still needs to go through a major revision.
Response 2.10: Thank you very much for your valuable suggestion, which helps us to improve both
quality and clarity in a great way. According to your comments and suggestions, we have modified the
paper and the reflections can be seed in the revised manuscript.
Download