Recommender systems play a vital role in boosting sales for e-commerce platforms. However, Much research in the area of recommender systems offers limited explainable perspectives and lacks insights into the analysis of user preferences. This study addresses this gap by proposing a Review-enhanced Multimodal Neural Attentive (RMNA) model for explainable recommendations. Specifically, the RMNA model integrates user reviews as a form of supervisory signal and employs attention networks based on product images and descriptive text to capture users’ multimodal fine-grained preferences across different image regions and text elements. Inspired by cognitive style theory, the RMNA model measures the influence of textual and visual information on individual purchase behavior. Experimental results demonstrate the effectiveness of our approach in producing personalized and interpretable recommendations. This study provides insights into understanding users’ purchasing decisions, and improving user satisfaction via the logic and reasons behind the recommendations.