
【论文笔记】Learning to Grasp with Primitive Shaped Object Policies


  • Abstract
    • A. Guided Policy Search
    • B. Convolutional Neural Network Policy
    • C. Cost function
    • A. Parameters
    • B. Generalization
    • C. Robustness


In this paper, we employed a reinforcement learning method based on , call , to learn policies for the grasping problem. 用这个学会把握问题。

The goal was to evaluate if policies trained , can achieve . 也就是说,训练模型采用简单原始的物体 能否在更复杂的形状下获得良好的抓取性能? 也就是说,本文主要测试模型的泛化性能

Additionally, a robustness test was conducted to show that the visual component of the policy helps to guide the system when there is an error in the estimation of the target object pose. 此外,还进行了鲁棒测试,表明当估计目标对象的姿势有误时,策略的视觉分量有助于指导系统。



然而,为了计算抓取姿势,规划器需要一个精确的组件模型。 这一要求使计划人员无法处理

To cope with this problem, we propose a novel approach where the assembly planner itself is solved , and at the time of performing the grasping, the shape difference . 为了解决这个问题,我们提出了一种新的方法来解决装配规划师本身的原始形状。执行时,每个部分之间通过使用

We used a policy search algorithm called . The algorithm learns a to control a robotic arm based on and . 视觉反馈、抓取位姿估计

In our method, we propose representing the policy using that accepts , , and . 卷积神经网络解决问题 输入是:图片、抓取位置(关节和速度参数)和机械臂的控制参数



Most of these works use manual engineered representations to design specialized policies classes or features. However, the high dimensional complexity of robotic systems and unstructured environments demand additional constraints to the general reinforcement learning formulation to enable its application in the real world. 这些工作大多用设计特殊的策略或特征。 然而,一般学习公式需要加强。,使其在现实世界中的应用。

Moreover, end-to-end learning approaches have been proposed where the system learns a joint model using vision input in a data-driven manner, such that, after collecting thousands of samples of successful and unsuccessful manipulations, the robot learn a model which controls the manipulation directly from input images. 还提出了,其中系统以的方式使用,这样,在收集了数千个成功和不成功操作的样本后,机器人从输入图像直接控制操作的模型。

Author Work Paper
Pinto & Gupta trained a (CNN) for the task of predicting grasp locations by and needed to create the dataset. L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 3406–3413, IEEE, 2016.
Levine et al. trained a deep CNN to predict the probability that the gripper will result in successful grasps using a datase of 900K grasp attempts to learn hand-eye coordination for grasping; over two months and eight robots working simultaneously were used to collect this dataset. S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in International Symposium on Experimental Robotics, pp. 173–184, Springer, 2016.
  • reduce the burden of manual feature engineering
  • typically required a massive amount of training data
  • not practical for application.

() methods seek to address this challenge by into and . 将分解为 这些算法将策略搜索问题转化为一个有监督的学习问题,并由简单的轨迹中心强化学习方法提供监督。


Thus, from -that can be - the GPS method iterates from drawing samples from the current policy, using this samples to fit the dynamics that are used to improve the trajectory distribution, and training the policy using the trajectories as training data. 因此,从一个初始策略——可以是一个随机高斯控制器——GPS方法迭代,从当前策略中抽取样本,使用这个样本来拟合用于,并使用轨迹作为训练数据来训练策略。

On the other hand, our method calculates a grasping posture of the primitive shaped model, and the difference between the actual object and its primitive shaped model is solved by using the reinforcement learning. 另一方面,该方法计算了,并通过求解了实际对象与原始形状模型之间的差异。


We consider the robot manipulation task of grasping a target object using a robot arm equipped with a simple parallel gripper. 实现的方式是:机械臂+并行的夹具

  • a grasp/assembly planner: be used to compute of the target object. 抓取规划者:可以计算抓取目标的机械臂/夹具的位姿
  • is known in order to compute the grasping posture. 为了计算抓取物的位姿,物体的位置是已知的
  • error in this pose estimation can be mitigated thanks to 对于规划者估计出来的实际物体的位姿存在误差问题,可以通过这个策略的视觉组件减小误差
  • The grasping posture is computed by using . 采用一组原始形状和动作实现物体形状的近似。

A. Guided Policy Search

In policy search algorithms, the goal is to learn a policy π ( u t , o t ) \pi(u_{t}, o_{t}) π(ut​,ot​), for taking actions u t u_{t} ut​ conditioned on the observations o t o_{t} ot​ to control a dynamical system. Given an stochastic dynamics p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_{t}, u_{t}) p(xt+1​∣xt​,ut​) and a cost function l ( x , u ) l(x, u) l(x,u), the goal is to minimize the expected cost under the policy’s trajectory distribution, ∑ t = 1 T l ( x t , u t ) \sum_{t=1}^{T} l(x_{t}, u_{t}) ∑t=1T​l(xt​,ut​).

In guided policy search, this optimization problem is addressed by dividing the problem into two components: part and one. 一个是轨迹优化部分,一个是监督学习部分

These samples are store as trajectories of the form { x t , u t , x t + 1 } \{x_{t}, u_{t}, x_{t+1}\} { xt​,ut​,xt+1​}. 也就是说,学习的轨迹也就是深度强化学习里面通俗的“一条经验”

Then, in the trajectory optimization part, learn using , the optimization is constrained to be closed to the trajectory described by the policy. 然后,在轨迹优化部分,利用学习线性控制器,将优化约束为接近该策略所描述的轨迹。


B. Convolutional Neural Network Policy



  • 我们收集了大约2000张图像的数据集,其中包含原始形状的物体和不同任意位置的机械臂。
  • 我们构造一个CNN,与前面提到的结构相同。空间softmax层,和完全连接层,其次是一个()激活函数产生的预测位置(),我们发现这个激活函数执行比整流器线性单元()性能更好。
  • 第一层的滤波器用在分类数据集上训练的初始空间模型的权重进行初始化。
  • 夹具和对象的位置都被编码为3个点。
  • 这个CNN的训练是使用Adam优化器进行批优化的

C. Cost function

In this work, the cost function for the grasping task was defined in terms of and then if the grasp is . l ( x t , u t ) = ω l 2 d t 2 + ω l o g l o g ( d t 2 + α ) + ω u ∣ ∣ u t ∣ ∣ 2 + ω g C g r p l(x_{t},u_{t})=\omega_{l_{2}}d_{t}^{2}+\omega_{log}log(d_{t}^{2}+\alpha)+\omega_{u}||u_{t}||^{2}+\omega_{g}C_{grp} l(xt​,ut​)=ωl2​​dt2​+ωlog​log(dt2​+α)+ωu​∣∣ut​∣∣2+ωg​Cgrp​ d t → d_{t} \rightarrow dt​→ 端部执行器空间中的三个点之间的距离 the weights are set to ω l 1 = 1.0 , ω l 2 = 10.0 , ω u = 1.0 , a n d ω g = 1.0 \omega_{l_{1}} = 1.0, \omega_{l_{2}} = 10.0, \omega_{u} = 1.0, and \omega_{g} = 1.0 ωl1​​=1.0,ωl2​​=10.0,ωu​=1.0,andωg​=1.0.

  • 二次项鼓励在目标较远时向目标移动;
  • 对数项鼓励将其精确地放置在目标位置;
  • 涉及动作 u t u_{t} ut​的项被用来鼓励智能体去寻找一个高效的轨迹;
  • C g r a s h C_{grash} Cgrash​项被定义为成功抓取的奖励,该奖励只在每一episode的最后一step给予。

C g r a s p = { − 10 , i f   g r a s p   i s   s u c c e s s f u l . 1 , o t h e r s C_{grasp} = \begin{cases} -10,&if \ grasp \ is \ successful. \\ 1,&others \end{cases} Cgrasp​={ −10,1,​if grasp is successful.others​ 如果在机器人尝试抓握并将手臂恢复到初始位置后,物体仍然被机器人抓握住,则认为抓取成功。


A. Parameters

所有的实验都是在Gazebo模拟器上的一个模拟的Baxter机器人上进行的。 该机器人通过扭矩控制被控制在40hz。 该机器人的状态被定义为如下。 x = [ q q ˙ e e f e e ˙ f v f ] x= \begin{bmatrix} q \\ \dot q \\ eef \\ e \dot e f \\ vf \end{bmatrix} x=⎣⎢⎢⎢⎢⎡​qq˙​eefee˙fvf​⎦⎥⎥⎥⎥⎤​ q → q \rightarrow q→ 机器人的关节角,Baxter右臂的七个关节角 e e f → eef \rightarrow eef→ 当前之间的差异表示,其中 g p gp gp是任何给定时间的当前终端姿势。 该状态还包括 v f vf vf,即通过CNN层输入大小为 150 × 150 × 3 150 \times 150 \times 3 150×150×3 的RBG图像中提取的视觉特征。 照相机在每次实验中都保持固定状态。每集的时长为100步。

B. Generalization





似乎这些物体的径向对称形状使策略更容易实现把握。 在使用立方体训练策略的情况下,夹持器需要以特定的方向正确地接近立方体才能成功,这使任务更难完成。 另一方面,对于使用不同形状集进行训练的策略,该算法能够更好地捕获所有训练对象的共同特征。

C. Robustness

目的是评估基于视觉成分的策略如何适应的目标姿态估计的误差。 该测试包括输入具有抓取姿势的策略,其中包括目标物体实际位置的误差偏移。




标签: 用于夹持一对连接器的夹具

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台