Current methods for analyzing simultaneous single-cell Hi-C and RNA-seq data rely on separate single-modal embeddings, failing to capture the intrinsic regulatory connections between chromatin architecture and gene expression. Here, we present scMUT, a transformer-based cross-modality representation learning framework that aligns scRNA-seq and scHi-C data into a unified feature space via contrastive learning. We demonstrate that scMUT effectively integrates multimodal information, enabling transfer learning across downstream tasks including scHi-C resolution enhancement, scRNA-seq denoising, and cell-type annotation. Furthermore, scMUT reveals biologically meaningful insights into the relationship between genome structure and transcription, identifying a previously uncharacterized blood cell subtype during early embryonic development. Our approach provides a versatile tool for joint analysis of simultaneous single-cell multi-omics data.