Paper Number: 021102 An ASAE Meeting Presentation Measurement of 3-D Locations of Fruit by Binocular Stereo Vision for Apple Harvesting in an Orchard Teruo Takahashi, Dr. Hirosaki University, 3 Bunkyo-cho Hirosaki Japan, teruo@cc.hirosaki-u.ac.jp Shuhuai Zhang, Dr. Hirosaki University, 3 Bunkyo-cho Hirosaki Japan, zhang@cc.hirosaki-u.ac.jp Hiroshi Fukuchi Hirosaki University, 3 Bunkyo-cho Hirosaki Japan, fukuchi@cc.hirosaki-u.ac.jp Written for presentation at the 2002 ASAE Annual International Meeting / CIGR XVth World Congress Sponsored by ASAE and CIGR Hyatt Regency Chicago Chicago, Illinois, USA July 28-July 31, 2002 Abstract. This paper describes the results of measurement of 3-D locations of fruit by binocular stereo vision for apple harvesting in an orchard. In the method of image processing, a 3-D space is divided into a number of cross sections at an interval based on disparity that is calculated from a gaze distance, and is reconstructed by integrating central composite images owing to stereo pairs. Three measures to restrict false images were proposed: (1) a set of narrow searching range values, (2) comparison of an amount of color featured on the half side in a common area, and (3) the central composition of the half side. Experiments with a trial stereo system were conducted on ripe apples in red (search distance ranging from 1.5m to 5.5m) and in yellow-green (search range of 2m to 4m) in an orchard. The results showed that two measures of (1) and (3) were effective, whereas the other was effective if there was little influence of background color similar to that of the objects. The rate of fruit discrimination was about 90% or higher in the images with 20 to 30 red fruits, and from 65% to 70% in images dense with red fruit and in the images of yellow-green apples. The errors of distance measurement were about ±5%. Keywords. Fruit harvesting, Binocular stereo vision, Image processing, Correspondence problem, central composite image The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the official position of the American Society of Agricultural Engineers (ASAE), and its printing and distribution does not constitute an endorsement of views which may be expressed. Technical presentations are not subject to the formal peer review process by ASAE editorial committees; therefore, they are not to be presented as refereed publications. Citation of this work should state that it is from an ASAE meeting paper. EXAMPLE: Author's Last Name, Initials. 2002. Title of Presentation. ASAE Meeting Paper No. 02xxxx. St. Joseph, Mich.: ASAE. For information about securing permission to reprint or reproduce a technical presentation, please contact ASAE at hq@asae.org or 616-429-0300 (2950 Niles Road, St. Joseph, MI 49085-9659 USA). Introduction An automated machine or robot for apple harvesting must have the ability to discriminate an apple from its surroundings, to sense the apple’s relative location, and to measure the coordinate values of three dimensions (3-D) from the base point of the machine. Binocular stereo vision is an available measuring method by which to obtain such 3-D information in an outdoor field. However, practical solutions to the correspondence problem of stereopsis have not yet been found. Thus, use of a stereo vision system with an automated machine remains impracticable. Fruit images obtained in apple orchards frequently exhibit the correspondence problem or other image problems such as similar features, as in the case of a row of similar fruits or overlapped fruits, and an occluded or transformed image in which the fruit is hidden by leaves and branches. It is necessary to find a solution to, or take measures against, this type of problem in order to apply the method of binocular stereo vision to such images. A major point of conventional methods to measure distance by binocular stereo vision is the search for corresponding points of the same object on a pair of images from a left camera and a right one, respectively, in order to calculate distance based on triangulation principle. Since each image of the pair is taken by an independent camera, it is difficult to determine the corresponding points, especially when several similar objects exist on an epipolar plane, and it is said that the results are deficient in reliability. A precept of the method used in the present study is the establishment of monocular vision that is consisted of two cameras used in cooperation with under triangulation principle to aim improvement of reliability. A central image of monocular vision is obtained by composing a stereo pair of images on a cross section of a search space in the direction of depth (Takahashi et al., 1998 and 2000a). Therefore, preparation of several central images involves dividing the space into slices, and integration of these slices uses 2-D images to reconstruct a 3-D version of the object. The correspondence problem will be reduced if the central composite images are obtained under the above mentioned condition from the outset. But a stereo vision system with current available cameras that are not designed to serve this function is apt to create false images, even if the method of composing central images is adopted. Lately, we proposed three measures to restrict occurrence of the false images (Takahashi et al., 2000b). This paper describes the results of performance of discrimination and accuracy of measurement that were obtained by applying these methods to the images containing many apples in red and in yellow-green. Central composite image and measures to restrict occurrence of false images A typical example of images demonstrating the correspondence problem is shown in Figure 1, where images of 4 circular plates (P11 to P44) in space were obtained in the same order from the left and right, as shown in Figure 1(a). In Figure 1(b) of the top view, four genuine image are illustrated by thick bars, and twelve false images will be created regularly at positions indicated by thin bars under combinations of object images. In this case, central composite images and measures to restrict occurrence of false images are explained as follows. 2 Z l1 r3 l2 P11 P22 P33 P44 l4 r1 P22 P33 P11 left image l3 r4 r2 P44 Z33 Z32 f P11 P22 P33 P44 right image (a) a stereo pair of object images C 0 L R X the left side of left and right eye-lines the right side of left and right eye-lines (b) Measures, introduced herein, to address the correspondence problem. Figure 1. A typical example of the correspondence problem in binocular stereo vision. Central composite image based on a stereo pair of images An image of monocular vision is obtained on a central screen through a visual point C, and the relationships of coordinates among the central image, (xc, yc), a left image, (xl, yl), a right image, (xr, yr), and disparity s are represented by next equations: xl = xc + s /2, xr = xc - s /2, yl = yr = yc. (1) The relationships of coordinates between the central image and the space (X, Y, Z), are represented by the equations: X = ab * xc / s, Y = ab * yc / s, Z = Cd * f * ab / s, (2) where ab is the optical interval of two cameras, f is the focal length, and Cd is a scale factor of the central screen. When the Z33 coordinate is given as a gaze distance, a disparity value, s33, is calculated backward by equation (2), and is constant. The cross section of Z33 is projected on the central screen by relations of equation (1). In the images of P11 and P33 on the central screen, their whole area is clear because their left images and right images overlap completely with each other. In the images of P22 and P44, the central parts of their areas is clear, but the sides are vague because their left images and right images overlap with slight discrepancies. Therefore, if clarity is detected on a section, information regarding its color, position and distance is obtained. The searching space is divided into some cross sections that are calculated by disparities in front of and behind s33. And the space image is reconstructed by the central composite images where the clear sections were integrated. The distance resolution is determined by disparity interval whose minimum value is one pixel. 3 For enhancement of clarity, composite image were made and arranged using a method that alternatively selected even lines of the left image and odd lines of the right image, respectively. An index of clarity, Ch i,j, of a pixel (i, j) is represented by variance of density of HSI -- i.e., hue, hu, saturation, sa, and intensity, it, -- as follows: Chi,j = sqrt{(hui,k - hua)2 + (sai,k - saa)2 + (iti,k - ita)2}, (k = j - m to j + m) (3) where the subscript a represents the average from j-m to j+m. When the left and right images of an object overlapped well, the color difference in the vertical direction is smallest, and the average value, Cha, on a whole area of the object reaches a minimum. Measures to restrict the occurrence of false images In Figure 1(b), if a cross section is located at Z32, three clear false images appear at the central screen because the image of P33 at the left screen and the image of P22 at the right screen overlap on the cross section, as will be the case with the other false images. There is no difference between the genuine images and the false images if the color and shape of their objects are equal each other. Based on the occurrence of these false images, three measures to restrict it are as follows. If a searching range includes the cross section of Z33 and but did not include that of Z32, false images on the Z32 does not appear on the central screen that the range is projected. Therefore, the first measure is to set a search range in the direction of depth narrow enough to prevent false images from occurring, and is to move the range from near to far gradually. The second measure is to compare the amount of color featured on an object. In Figure 1(b), the amount of color featured at the left side of a line, l3, of the left camera through the genuine image, P33, is equal to that at the left side of a line, r3, of the right camera. The same condition exists on the right side of both lines. In the case of a line r2 through false image on Z32, the amounts of color featured at the left side of both lines, and at the right side of both lines, do not agree with each other, respectively. Here, the symbol Slij represents the absolute value of difference between the amount of color featured at the left side of li and that featured at the left side of rj, and the symbol Srij represents the same amount of color at the right side of both lines. The symbol Saij is a summation of these amounts, as follows: Saij = | Slij | + | Srij |. (4) Thus, the value of Saij will become a minimum when li and rj intersect at genuine images. This relation is useful for judging genuine central images when object images on the left and right screens are of rows of the same order. The third measure involves composition on each half-side of an object image. In Figure 1(b), the dots area at the left side of P33 is a common area between the left side of l3 and that of r3, and the dots area at the right side of P33 is a common area between the right side of these regions. When the left side of a central image is composed from the respective left half side of the left and right images, and its right side is composed from respective right side, the whole image will become clear only in these common areas. On borders and outside of these common areas, the image will become unclear, and the average value of variance of the color feature increases because the overlap of the left and right images cannot help but be incomplete. In the case of 4 Figure 1(b), four genuine images and only two false images among total number of created images are clear, and the others are restricted. However, if this method is applied to a false image, genuine images decrease and, conversely, false images increase. Experiments Instrumentation The main components of a trial system of binocular stereo vision were two CCD color video cameras (SONY EVI-310, 1/3" CCD element, interline type, 768x494 pixels, focal length of 5.9mm to 47.2mm, F1.4), video capture cable (RGB 24bit, 640x480 pixels maximum), and a laptop computer (Pentium 266 MHz, 64MB RAM). The zoom lens mechanism of the cameras was controlled by the computer through a video system control architecture (VISCA) network on a RS232C interface. The convergence angle of the two cameras was 0.2 degrees. In image processing, correction of lens aberration was conducted for each image. Experimental method The objects were red ripe 'Fuji' apples and yellow-green 'Orin' apples in an orchard. Images of these objects were obtained under front-lighting conditions in sunny weather. The camera's distance from the objects ranged from about 1.5m to 5.5m. Image size was 320x240 pixels. Adjustments to focus, exposure, and white balance were conducted automatically. In the image processing, composite images were produced at a size of 260x220 pixels in RGB at a 24bit density. Average variance of the color featured at each pixel on composite images was calculated in a range of pixels equivalent to an actual length of 8cm, which was equivalent to the average diameter of fruits. Results and discussion The results of processing red apple images Figure 2 is a pair of original images of red variety 'Fuji' apples. The fruits were located at a range of 3.4m to 4.5m from the cameras. The focal length of the cameras was 13.7 mm, and the optical interval was 300 mm. The fruit was about 80 mm to 85 mm in diameter. The image processing was carried out at a gaze distance of 3.4m, a disparity interval of 3 pixels, with 11 cross sections by disparity (in a search range of 2.9m to 4.2m), and hue values of red range and average variance were used for discrimination. The central images were composed by overlapping right images on left images at each cross section of distance by disparity. The total number of fruits was 33; the number of horizontal rows was 3; the number of overlapping fruits was 7; the number of fruits occluded by other fruits, leaves, or branches was 10; and the number of fruits with complete shape was 8 in the frame of the left image. Figure 3 shows a process of central images composed from the same half sides of left and right images. Each image in Figure 3(a) is of cross sections of 3,201 mm, 3,429 mm, and 3,691 mm, respectively. The images of fruits became clear at the cross section of 3,429 mm near the actual distances of the objects. The central vertical lines at the images of 3,201 mm and 3,691 mm show borders between the left half side and the right half side in the common area. The values of Sa were related to distance by disparity in Figure 3(b). They decreased in the front of the actual distance, and fell to a minimum at a distance of about 0.2m shorter than the actual distance of the object. This tendency are useful for determine a gaze distance. Average 5 variance, Cha, of color density became a minimum at a distance near the actual distance. As the minimum value of average variance increases when the gaze distance was far from the actual distance, comparison of the value is useful for measuring distance. left right Figure 2. A pair of original images of red variety 'Fuji' apples. (a) Central composite images in processing Cha 1500 40 1200 30 900 20 600 10 300 0 2500 3000 3500 distance, mm 4000 Cha Sa Sa 50 0 4500 (b) Relationships among distance, Sa, and Cha. Figure 3. A process of central images composed from the same-half sides of left and right images: (a) central images of three cross sections of 3,201 mm, 3,429 mm, and 3,691 mm; (b) relationships among distance by disparity, Sa, and Cha. 6 The final composite image, shown in Figure 4(a), was obtained by gathering pixels whose average variance was minimum among the images of the cross sections. Color and shape of fruit composite images in the search range were clear approximately. The shape of the gazed-at fruit image in the center of the central composite screen was transformed gradually as the gaze distance got farther from the actual distance. Figure 4(b) shows the distribution of average variances in the area of fruit images. The divided value is represented by the symbols 'A' to 'M' and 'n' to 'z' in ascending order. In the results, twenty seven fruit images were marked, and eight images had two or more symbols because a single fruit image was divided into several areas which were measured at different distances. Three images where fruit images overlapped were not separated. The interval of cross sections and their positions is related to the separation of these images. Figure 4(c) shows the fruit images marked with symbols of 'A' to 'z' that represent 52 divisions of the distance from 1.3m to 5.0m. Twenty six fruit images were marked with each symbol. There were two areas where fruit images overlapped (i.e., one area had two fruit images and the other had three fruit images), which are represented by one symbol, respectively. No image was marked with two or more symbols. The number of cases in which the marked point was far from the center of the fruit area was five because the integration of areas on the relevant fruit images was unsuitable. Thus, the composite images except the gazed image is possible to involve false images partially. (a) (b) (c) Figure 4. The results of image processing by central composition: (a) the final composite image, (b) symbol divisions of average variances in the area of fruit images, and (c) fruit images marked by symbols of 'A' to 'z' that represent the distance division in ascending order. An example of images densely containing red fruits is shown in Figure 5. Figure 5(a) is a pair of original images of red variety 'Fuji' apples. The fruits were located at a range of 4.7m to 5.5m from the cameras. The focal length of the cameras was 13.7 mm, and the optical interval was 250 mm. The image processing was carried out at a gaze distance of 4.8m, a disparity interval of one pixel, and with 11 cross sections (in a search range of 4.4m to 5.4m). The total number of fruits in the white frame area was 46; the number of horizontal rows was 7; the number of overlapping fruit images was 8; the number of fruits occluded by other fruits, leaves, or branches was 4; and the number of fruits with complete shape was 14. Figure 5(b) shows the distribution of average variances in the area of fruit images. Forty fruit images were marked, and twenty one images had two or more symbols on each fruit image. Three images where fruit images overlapped each other were not separated, and are represented by a symbol. Since the distance resolution of the stereo vision system becomes low as the objects is farther, the difficulty in separating the overlapped areas increases. But discrimination of fruit images by the present method tended to be better than that by the method of simple composition. The tendency of Sa on the amount of color featured was found different from the expectation due to 7 the influence of background color similar to that of the objects. The calculation of distance is conducted on the areas of fruit images that was marked with the symbol of average variance. Figure 5(c) shows the fruit images marked with symbols of 'A' to 'L', which represent the distance divisions of 4.4m to 5.5m. Thirty one fruit images were marked. Three areas where fruit images overlapped each other were represented by one symbol, respectively. Two fruit images were marked with two or more symbols. The number of cases in which the marked point was far from the center of the fruit area was 9. (a) (b) (c) Figure 5. An example of processing images densely containing fruits: (a) a pair of original images of red variety 'Fuji' apples, (b) the distribution of average variances in the area of fruit images, and (c) the fruit images marked by symbols of 'A' to 'L' that represent the distance division in ascending order. The results of images of a yellow-green apple In Figure 6(a), an example of original images of yellow-green variety 'Orin' apples is shown. The fruits were located at a range of 2.3m to 4.5m from the cameras. The focal length of the cameras was 8.6 mm, and the optical interval was 250 mm. The image processing was carried out at a gaze distance of 2.8m, a disparity interval of 2 pixels, and with 11 cross sections (in a search range of 2.3m to 3.6m in distance). The total number of fruits was 20 in the white frame area; the number of horizontal rows was one; the number of overlapping fruits was 3; the number of fruits occluded by other fruits, leaves, or branches was 8; and the number of fruits with complete shape was 10. Figure 6(b) shows the distribution of average variances in the area of fruit images. Sixteen fruit images were marked, and no fruit image had two or more symbols. 8 Two areas of fruit images that overlapped each other were not separated, and were represented by a symbol. Four area of leaves were marked with symbols. Figure 6(c) shows the fruit images marked by the same symbols as those in Figure 4(c). Thirteen fruit images were marked. One area in which fruit images overlapped each other is represented by one symbol. Four area of leaves are marked by the symbols. The number of cases in which the marked point was far from the center of fruit area was 5. (a) (b) (c) Figure 6. An example of processing images of yellow-green variety 'Orin' apples: (a) a pair of original images, (b) the distribution of average variances in the area of fruit images, and (c) the fruit images marked by symbols of 'A' to 'z' that represent the distance division in ascending order. 10 Fuji-2.3 erorr, % 5 Fuji-3.4 0 Fuji-4.7 Orin-3.0 -5 -10 1000 2000 3000 4000 distance, mm 5000 6000 Figure 7. Relationship between photographing distance and error of distance. 9 Errors of measured distances A laser range meter with accuracy of 3 mm was used to calibrate the values of measurement by the trial stereo system. In Figure 7, the results for the red apples showed that the error in determining distances was about ±5% (an average value of -1.3% and a standard deviation of 3.5%) at a distance range of 1.7m to 5.2m. In cases in which fruit images overlapped each other, the error was larger because false images appeared in unsuitable correspondence with the stereo pair. The number of false images tended to decrease when the search range was narrowed in the direction of depth. The results for the yellow-green apples showed that errors of distance were about ±5% (an average value of 1.6% and a standard deviation of 3.6%) in a distance range of 2.3m to 4.3m if the value of color variance of the fruit image differed from those of the surrounding leaves. Conclusion The results of three measures to address the correspondence problem showed that two of these measures--the set of values of a narrow searching range and the central composition of the same-half sides of the common area--were effective, but comparison of the amount of color featured on the same-half sides was not always effective due to the influence of background color. However, the rate of fruit discrimination was about 90% or higher in the image with 20 to 30 fruits; from 65% to 70% in dense red fruit images and in yellow-green apple images; and the errors of distance measurement were ±5%. Based on these results, the method of image processing utilizing these measures will be useful in research on the application of a stereo vision system. References Takahashi, T., S. Zhang, M. Sun, and H. Fukuchi. 1998. New Method of Image Processing for Distance Measurement by a Passive Stereo Vision. ASAE Meeting Paper No.983031. St. Joseph, Mich.: ASAE. Takahashi, T., S. Zhang, H. Fukuchi and E. Bekki. 2000a. Binocular Stereo Vision System for Measuring Distance of Apples in Orchard (Part 1) - Method due to composition of left and right images -. JSAM 62(1): 89-99 (in Japanese). _____. 2000b. Binocular Stereo Vision System for Measuring Distance of Apples in Orchard (Part 2) - Analysis of and solution to the correspondence problem -. JSAM 62(3): 94-102 (in Japanese). 10