End-to-end optimization of goal-driven and visually grounded dialogue systems — arXiv2