Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting fromRGBvideo streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3DCNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.