We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding boxes to facilitate the characterization of the underlying human–object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary.
IEEE Transactions on Image Processing, Vol. 25, Issue 11, Nov 2016, Pg 5479-5490, doi: 10.1109/TIP.2016.2605305