In urban environments, positioning performance of global navigation satellite system (GNSS) degrades severely due to the harsh observation environment, which will be improved by integrating the strap-down inertial navigation system (SINS). However, owing to the error accumulation, the pose accuracy of GNSS/SINS integration will also degrade significantly for frequent gross errors and cycle slips in GNSS harsh environment, which can be further enhanced via fusing visual information. However, unmodelled errors will be introduced in the multi-sensor fusion system by inaccurate spatial parameters, which are caused by coarse calibration or slight changes in mechanical structure. Moreover, an unsynchronized temporal relationship will also influence the pose and extrinsic accuracy. To reduce the negative influence brought by spatial and temporal misalignment, we propose a vision-aided GNSS/SINS integration system with online spatial and temporal compensation, where the time delay and extrinsic of SINS and camera will be estimated and compensated online together with the SINS state. The influence of temporal compensation and GNSS information on extrinsic accuracy is analyzed with a Monte Carlo simulation and a real-world experiment. To evaluate the positioning performance of the multi-sensor fusion system, in-field experiments were conducted to assess the performance of the proposed multi-sensor fusion system. The results shows that: 1) with the support of vision and SINS, GNSS accuracy can be considerably improved in complex environments, which also outperforms the GNSS/SINS integration; 2) online spatial-temporal compensation can significantly improve the pose accuracy.