Resolving Automated Perception System Failures in Bin-Picking Tasks Using Assistance from Remote Human Operators Krishnanand N. Kaipa, Srudeep Somnaath Thevendria-Karthic, Shaurya Shriyam, Ariyan M. Kabir, Joshua D. Langsfeld, and Satyandra K. Gupta Maryland Robotics Center University of Maryland, College Park, MD 20742 Email: skgupta@umd.edu Abstract— We present an approach to resolve automated perception failures during bin-picking operations in hybrid assembly cells. Our model exploits complementary strengths of humans and robots. Whereas the robot performs binpicking and proceeds to the subsequent operation like kitting or assembly, a remotely located human assists the robot in critical situations by resolving any automated perception problems encountered during bin-picking. We present the design details of our overall system comprising an automated part recognition system and a remote user interface that allows effective information exchange between the human and the robot that is geared toward solutions that minimize human operator time in resolving the detected perception failures. We use illustrative real robot experiments to show that human-robot information exchange leads to improved bin-picking performance. I. I NTRODUCTION The National Association of Manufacturers estimates that the United States has close to 300,000 small and medium manufacturers (SMMs), representing a very important segment of the manufacturing sector. Currently, many manufacturing operations at SMMs are largely manual. Examples include machine loading/unloading, part inspection, part cleaning, bin-picking, and assembly. In contrast, these manual operations are often performed by robots on mass production lines. This clearly shows the potential of robots in manufacturing. However, current industrial robots are not considered useful in small production volume operations. Hence, SMMs have largely restrained from using them. As we move towards shorter product life cycles and customized products, the future of manufacturing in the US will depend upon the SMMs’ ability to remain cost competitive. High labor costs makes it difficult for SMM to remain cost competitive in high wage markets. However, setting up purely robotic cells is not a viable option for most SMM. Recently several advances have been made in industrial robots that make them safer for humans [1], [2], [3] and hence presenting an opportunity for creating hybrid work cells where humans and robots can collaborate in close physical proximities [4], [5], [6]. The underlying idea behind such cells is to decompose assembly operations into tasks such that humans and robots can collaborate by performing tasks that are especially suitable for them. Several new low cost robots have been introduced in the market over the last three years, making them cost effective in many manufacturing applications where utilization may not be very high. This makes the idea of hybrid cells economically viable in small volume production. In this paper, we present an approach to address perception failures during bin-picking tasks. The bin-picking operation involves identifying, locating, and picking a desired part from a container of randomly scattered parts. Usually, this operation is followed by either a kitting operation or an assembly operation. Many research groups have addressed the problem of enabling robots, guided by machine-vision and other sensor modalities, to carry out bin-picking tasks [7], [8], [9]. The problem is very challenging and still not fully solved due to severe conditions commonly found in factory environments [10], [11]. In particular, unstructured bins present diverse scenarios affording varying degrees of part recognition accuracies: 1) Parts may assume widely different postures, 2) parts may overlap with other parts, and 3) parts may be either partially or completely occluded. The problem is compounded due to factors like background clutter, shadows, complex reflectance properties of parts made of various materials, and poorly lit conditions. Our approach to address this problem primarily exploits the fact that humans and robots share complementary strengths in performing tasks. Whereas robots can repetitiously perform routine pick-and-place operations without any fatigue, humans excel at their perception and prediction capabilities in unstructured environments. They are able to recognize and locate a part from a bin of miscellaneous parts. Their sensory and mental-rehearsal capabilities enable humans to respond to unexpected situations. Accordingly, a deficit-compensation model can be designed as follows: the robot performs bin-picking under normal conditions and subsequently proceeds to assembly, while the human bails the robot in critical situations by resolving any perception problems encountered during bin-picking. Figure 1 shows a schematic of the envisioned hybrid work cell for a kitting operation consisting of four robots and two human operators. In this paper, we restrict ourselves to only one robot in the work cell and only one remotely located human operator. The collaboration is achieved by developing techniques for effective information exchange between the human and the robot. We assume that human operators will not have any programming expertise and hence they will need to exchange information with robots without writing code. We will need to figure out the least-time consuming way to Fig. 1. Hybrid work cell for kitting operations with four robots and two humans elicit the required information from human operators and least confusing way to deliver the information to the human operators. We mainly focus on structure of the information and the best mode to get and deliver the information. Primary research issues in this context include: • What is the most convenient way for robots to seek assistance from human operators in the assembly cell? • What is the most convenient way for humans to provide information to robots when robots need assistance from humans in completing a task? • What is the most convenient way for robots to assist human in recovering from an error? Primary means by which information can be delivered to human operators include speech, text, graphics [12], [13], [14], [15], virtual 3D environments [16], [17], and augmented reality [18], [19], [20]. Examples of augmented reality systems include a tracked head worn display that augments a human operator’s view with text, labels, arrows, and animations [19] and laser pointer mounted on a robot highlighting where a cable must be inserted [18]. Humans usually deliver task specific information to the robot either by teleoperation or a graphical user interface [21]. II. A PPROACH The robot uses an automated part recognition system to recognize a part and estimate its posture, plans its motion in order to grasp and transfer the part from the bin to the assembly area. However, if the robot determines that the part recognition is uncertain from the current scene, then it initiates a collaboration with a remotely located human operator. The particular bin scenario determines the specific nature of collaboration between the robot and the human. In particular, we address the problem of how the remote human can extract the relevant information that can be effectively used to resolve issues of part recognition and posture determination. For this purpose, we have developed an user interface with controls that allows a human operator to provide approximate postural information. A 3D matching algorithm uses the solution provided by the human as an initial seed and generates better estimates. A flowchart of the information exchange scheme is shown in Fig. 2. A brief description of each subsystem follows. A. Automated Perception System The baseline automation perception system used in this paper is built using Ensenso [22], a 3D stereo camera that provides point clouds of observed scenes. The Ensenso camera works using “projected texture stereo vision” principle. It has two integrated CMOS sensors and a projector that casts a random point pattern onto objects in the scene. This pattern enables capturing images of surfaces without any texture. The camera is interfaced to Halcon [23], a machine vision software that compares these point clouds with the CAD model of a target part to find part instances and the corresponding postures in the scene. The success rate of this system for easy-to-perceive and difficult-to-perceive parts is around 90% and 60%, respectively. Perception failures by the automation system is mainly due to uncertainty present in the sensed point cloud owing to several factors like background clutter, occlusions, shadows, and complex reflectance properties. Moreover, different postures may result in varying qualities of respective point clouds, especially for parts with arbitrary geometries (Some illustrative examples are shown in Section III). These issues make it difficult for 3D-registration algorithms to find good matches. B. Remote User Interface When the automated perception fails, the robot seeks help from the human operator by sending the raw camera image of the scene, the corresponding point cloud, and the index of the desired part to be picked. Accordingly, the user interface consists of the following display fields: 1) Raw camera image of the scene comprising the bin of parts (sent by automated perception system) 2) Point cloud of the scene obtained using a 3D camera 3) Perspective view of 3D CAD model of the target part 4) Display field to visualize the match between the CAD model and the point cloud scene as arguments. However, in the ICP-variant that we use in this paper, we create different subsets of the CAD model of the target part corresponding to different (maximally separated) views of the part and compare the cropped point cloud with each of these subsets to find a best match. We use the k-d tree matching type for faster computation. Currently, we use a constant weight for all point pairs. Rejection of certain pairs is achieved based on Euclidean distance in order to remove any outliers. The error metric of sum-of-squared distances between corresponding points along with singular value decomposition was used to find the transformation that minimized the error. The extrapolation option was used in which the iteration direction was evaluated and extrapolated if possible using the method as described in [24]. D. Accuracy/Time Tradeoff Fig. 2. Flowchart of information exchange scheme between robot and remote human operator The human operator primarily provide two inputs: 1) Selecting the region of interest. The human initially crops the raw image around the region containing the desired part. This information is used by the interface to remove as many points in the point cloud as possible that do not correspond to the desired part. 2) Generating an initial seed for a matching algorithm. The human adjusts the posture of the 3D model until it lies is in the vicinity of the reduced point could of the selected region. This will be used to initialize a matching algorithm. The above actions are enabled by the following user controls: (1) Cursor-based initialization of the position of the CAD model, (2) icon-based selection of initial orientation of the CAD model, joy stick interface to control roll, pitch, and yaw orientations of the CAD model, (3) cursor-based region of interest selection, and (4) Trigger button to initiate the back end matching algorithm. The user interface was coded in MATLAB software. C. 3D-Matching Algorithm We use a variant of Iterative Closest Point (ICP) [24] as the matching algorithm that runs in the back end of the user interface. The ICP implementation [25] available at MATLAB Central file exchange was used for this purpose. Variants of ICP are usually achieved by making modifications to different stages of the algorithm including selection of points in one or both meshes, matching type (e.g., brute force, Delaunay, k-d tree, etc.), weighting of pairs, rejecting certain pairs, assigning an error metric, and minimizing the error metric [26]. The standard version of the ICP algorithm takes the full point cloud sets of the reference model and the observed There is a tradeoff between accuracy and time needed to extract the data. However, orientation accuracy impacts grasping performance. The accuracy needed to successfully grasp a part depends on its shape complexity and its particular posture. This information is pre-determined for each part and conveyed to the human operator so that he/she can stop the estimation process once a good enough orientation accuracy is obtained. For this purpose, we placed a single instance of the target part on a tripod and used a digital inclination meter to set the orientation of the part at a known posture. In one sample experiment, we used a nominal orientation of 30 degrees about the longitudinal axis of the part and 35 degrees about the lateral axis of the part. Now, we manually introduced 2-degree increments of perception error about each axis and observed its impact on grasping performance. For the part shown in Fig. 4(b), we noticed that the robot was able to successfully grasp up to an error of ±8 degrees about the longitudinal axis. We noticed a high asymmetry about the lateral axis with successful grasping up to 8 degrees in the clockwise direction and only 2 degrees in the counter clockwise direction. III. I LLUSTRATIVE E XPERIMENTS The experimental setup consists of a Baxter robot, an automated perception system built using Ensenso 3D camera and Halcon software, and an user interface that communicates with the perception system remotely via the Internet (Figs. 3(a) and 3(b)). We considered representative industrial parts (Fig. 4(a)) that afford different recognition and grasping complexities to illustrate various challenges encountered during the bin-picking task. In this paper, we focussed our experiments with respect to the part shown in Fig. 4(b). This part presents both recognition as well as grasping complexities. In particular, the quality of the point cloud corresponding to this part is heavily influenced by its orientation relative to the 3D camera. Whereas the part is symmetric along its longitudinal axis, it is asymmetric along its lateral axis making the grasping problem nontrivial. We consider the following two bin scenarios. Fig. 3. (a) Baxter robot equipped with an automated perception system built using Ensenso 3D camera and Halcon software. (b) Remote user interface. Fig. 5. (a) Uniform bin: scene 1. (b) Part match found by automated perception system. Pose [x,y,z,roll,pitch,yaw] = (0.141415, -0.116390, 0.720805, 273.484056, 315.422986, 104.475811). (c) Robot uses the detected postural information to pick up the target part. Fig. 4. (a) Set of industrial parts used in bin-picking experiments. (b) CAD model of the target part to be picked by the robot. invoking of the ICP algorithm. This information is relayed in realtime to the robot. Next, the robot proceeds with picking up of the part (Fig. 6(c)). B. Mixed Bins A. Uniform Bins In this regime, bins have same type of parts. Figure 5(a) shows one such example. The automated perception system succeeds in detecting one instance of the desired part to be picked and its postural information (Fig. 5(b)). The Baxter robot uses this information to find a motion plan and pick the detected part (Fig. 5(c)). Figure 6(a) shows an example in which the automated system fails to find a match. This triggers the sending of the relevant data to the remote human operator. Figures 7(a) and 7(b) show snapshots of the part match found by the human using manual adjusting and In this regime, bins have different types of parts. Figure 9(a) shows one such example. The automated perception system succeeds in detecting one instance of the desired part to be picked and its postural information (Fig. 9(b)). The Baxter robot uses this information to find a motion plan and pick the detected part (Fig. 9(c)). Figure 10(a) shows an example in this regime on which the automated system fails to find a match. This triggers the sending of the relevant data to the remote human operator. Figures 8(a) and 8(b) show snapshots of the part match found by the human operator using manual adjusting and invoking of the Fig. 6. (a) Uniform bin: scene 2. (b) Perception failure by automated perception system. (c) Robot uses the postural information relayed by the remote human operator to pick up the target part. Fig. 8. Failure case of mixed bin scenario resolved by remote human operator using the user interface: (a) Snapshot of initial display. (b) Snapshot of scene point cloud and CAD model after match is found. (c) Point cloud of the cropped scene. (d) Posture of the matched CAD model. (e) Display of the final posture values of the CAD model. Fig. 7. Failure case of uniform bin scenario resolved by remote human operator using the user interface: ((a) Snapshot of initial display. (b) Snapshot of scene point cloud and CAD model after match is found. (c) Point cloud of the cropped scene. (d) Posture of the matched CAD model. (e) Display of the final posture values of the CAD model. ICP algorithm. This information is relayed in realtime to the robot. Subsequently the robot proceeds with picking up of the part (Fig. 10(c)). We have observed that in both failure cases the human operator was able to find part matches and postural information in a matter of few seconds. IV. C ONCLUSIONS We presented design details of our approach that enables resolving of automated perception failures in bin-picking tasks using assistance from remote human operators. We used illustrative experiments to present different regimes in which human robot information exchange can take place to resolve perception problems encountered during bin-picking. In this paper, we considered bin-picking used for assembly tasks. However, our approach can be extended to the general problem of bin-picking as applied to other industrial tasks like packaging. More experiments based empirical evaluations are in order for systematically testing the ideas presented in the paper. The human-robot collaboration based bin-picking described in this paper is one of the key modules required to achieve hybrid work cells for industrial tasks. In our previous work, we have developed other related modules including sequence planning for complex assemblies [27], instruction generation for human operations [28], ensuring human safety [29], and a framework for replanning to recover from errors [30]. As part of the future work, we plan to integrate these individual modules in order to realize the overall operation of the envisioned hybrid cell. R EFERENCES [1] Baxter, ”Baxter - Rethink Robotics”. [Online: 2012]. http://www.rethinkrobotics.com/products/baxter/. [2] Kuka, ”Kuka LBR IV”. [Online: 2013]. http://www.kukalabs.com/en/medical robotics/lightweight robotics/. [3] ABB, ”ABB Friendly Robot for Industrial Dual-Arm FRIDA”. [Online: 2013]. http://www.abb.us/cawp/abbzh254 /8657f5e05ede6ac5c1257861002c8ed2.aspx. [4] Krüger, J., Lien, T., and Verl, A., 2009. “Cooperation of human and machines in assembly lines”. CIRP Annals - Manufacturing Technology, 58(2), pp. 628 – 646. [5] Shi, J., Jimmerson, G., Pearson, T., and Menassa, R., 2012. “Levels of human and robot collaboration for automotive manufacturing”. In Proc. Workshop on Performance Metrics for Intelligent Systems, pp. 95–100. [6] Shi, J., and Menassa, R., 2012. “Transitional or partnership human and robot collaboration for automotive assembly”. In Proc. Workshop on Performance Metrics for Intelligent Systems, pp. 187–194. [7] Buchholz, D., Winkelbach, S., and Wahl, F.M., 2010. “RANSAM for industrial bin-picking”. In Proc. International Symposium on Robotics and German Conference on Robotics. Fig. 9. (a) Mixed bin: scene 1. (b) Part match found by automated perception system. Pose = (0.141415, -0.116390, 0.720805, 273.484056, 315.422986, 104.475811). (c) Robot uses the detected postural information to pick up the target part. Fig. 10. (a) Mixed bin: scene 2. (b) Perception failure by automated perception system. (c) Robot uses the postural information relayed by the remote human operator to pick up the target part. [8] Balakirsky, S., Kootbally, Z., Schlenoff, C., Kramer, T., and Gupta, S.K., 2012 “An industrial robotic knowledge representation for kit building applications”. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2012), PP. 1365–1370. [9] Schyja, A., Hypki, A.,and Kuhlenkotter, B., 2012. “A modular and extensible framework for real and virtual bin-picking environments”. In Proceedings of IEEE International Conference on Robotics and Automation, pp. 5246–5251. [10] Liu, M-Y., Tuzel, O., Veeraraghavan, A., Taguchi, Y., Marks, T.K., and Chellappa, R., 2012. “Fast object localization and pose estimation in heavy clutter for robotic bin picking”. The International Journal of Robotics Research, 31(8), pp. 951–973. [11] Marvel, J.A., Saidi, K., Eastman, R., Hong, T., Cheok, G., and Messina, E., 2012. “Technology Readiness Levels for Randomized Bin Picking”, In Proceedings of the Workshop on Performance Metrics for Intelligent Systems, pp. 109–113. [12] J. Heiser, D. Phan, M. Agrawala, B. Tversky, and P. Hanrahan, “Identification and validation of cognitive design principles for automated generation of assembly instructions,” in Proceedings of the Working Conference on Advanced Visual Interfaces, ser. AVI ’04. New York, NY, USA: ACM, 2004, pp. 311–319. [13] M. Dalal, S. Feiner, K. McKeown, S. Pan, M. Zhou, T. Höllerer, J. Shaw, Y. Feng, and J. Fromer, “Negotiation for automated generation of temporal multimedia presentations,” in Proceedings of the Fourth ACM International Conference on Multimedia, ser. MULTIMEDIA ’96. New York, NY, USA: ACM, 1996, pp. 55–64. [14] G. Zimmerman, J. Barnes, and L. Leventhal, “A comparison of the usability and effectiveness of web-based delivery of instructions for inherently-3d construction tasks on handheld and desktop computers,” in Proc. International Conference on 3D Web Technology, New York, NY, USA: ACM, 2003, pp. 49–54. [15] S. Kim, I. Woo, R. Maciejewski, D. S. Ebert, T. D. Ropp, and K. Thomas, “Evaluating the effectiveness of visualization techniques for schematic diagrams in maintenance tasks,” in Proceedings of the 7th Symposium on Applied Perception in Graphics and Visualization, ser. APGV ’10. New York, NY, USA: ACM, 2010, pp. 33–40. [16] D. Dionne, S. de la Puente, C. León, R. Hervás, and P. Gervás, “A model for human readable instruction generation using level-based discourse planning and dynamic inference of attributes disambiguation,” in Proceedings of the 12th European Workshop on Natural Language Generation, ser. ENLG ’09. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 66–73. [17] J. E. Brough, M. Schwartz, S. K. Gupta, D. K. Anand, R. Kavetsky, and R. Pettersen, “Towards the development of a virtual environmentbased training system for mechanical assembly operations” Virtual Reality, vol. 11, no. 4, pp. 189–206, 2007. [18] F. Duan, J. Tan, J. G. Tong, R. Kato, and T. Arai, “Application of the assembly skill transfer system in an actual cellular manufacturing system,” Automation Science and Engineering, IEEE Transactions on, vol. 9, no. 1, pp. 31–41, Jan 2012. [19] S. Henderson and S. Feiner, “Exploring the benefits of augmented reality documentation for maintenance and repair,” Visualization and Computer Graphics, IEEE Transactions on, vol. 17, no. 10, pp. 1355– 1368, Oct 2011. [20] D. Kalkofen, M. Tatzgern, and D. Schmalstieg, “Explosion diagrams in augmented reality,” in Virtual Reality Conference, 2009. VR 2009. IEEE, March 2009, pp. 71–78. [21] A. Pichler and C. Wogerer, “Towards robot systems for small batch manufacturing,” in Assembly and Manufacturing (ISAM), 2011 IEEE International Symposium on, May 2011, pp. 1–6. [22] Ensenso 3D Camera, ”Ensenso N10 3D Camera - IDS Imaging Development Systems GmbH”. https://en.idsimaging.com/store/produkte/kameras/ensenso-n10-3d-usb-2-0.html. [23] Halcon Software, ”Halcon 12.0 - MvTec Software GmbH”. http://www.halcon.com/. [24] Besl, P.J. and McKay, Neil D., 1992. “A method for registration of 3-D shapes”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2). pp. 239–256. [25] Wilm, J. ”ICP code - MATLAB Central File Exchange”. http://www.mathworks.com/matlabcentral/fileexchange/27804-iterativeclosest-point/content//icp.m [26] Rusinkiewicz, S., and Levoy, M., 2001. “Efficient variants of the ICP algorithm”. In Proceedings of the Third International Conference on 3D Digital Imaging and Modeling, pp. 145-152. [27] Morato, C., Kaipa, K. N., and Gupta, S. K., 2013. “Improving Assembly Precedence Constraint Generation by Utilizing Motion Planning and Part Interaction Clusters”. Journal of Computer-Aided Design, 45 (11), pp. 1349–1364. [28] Kaipa, K. N., Morato, C., Zhao, B., and Gupta, S. K. “Instruction generation for assembly operation performed by humans”. In ASME Computers and Information in Engineering Conference, Chicago, IL, August 2012. [29] Morato, C., Kaipa, K. N., and Gupta, S. K., 2014. “Toward Safe Human Robot Collaboration by using Multiple Kinects based Real-time Human Tracking”. Journal of Computing and Information Science in Engineering, 14(1), pp. 011006. [30] Morato, C., Kaipa, K. N., Liu, J., and Gupta, S. K., 2014. “A framework for hybrid cells that support safe and efficient human-robot collaboration in assembly operations”. ASME International Design Engineering Technical Conferences & Computers and Information in Engineering Conference, Buffalo, New York.