Gesture recognition is applied in various intellig- ent scenes. In this paper, we propose the multi-modality fusion temporal segment networks (MMFTSN) model to solve dynamic gestures recognition. Three gesture modalities: RGB, Depth and optical flow (OF) video data are equally segmented and randomly sampled. Then, the sampling frames are classified using convolutional neural network. Finally, fusing three kinds of modality classification results. MMFTSN is used to obtain the recognition accuracy of 60.2% on the gesture database Chalearn LAP IsoGD, which is better than the result of related algorithms. The results show that the improved performance of our MMFTSN model.