Multimodal Spiking-Mixer with robustness-improved ODE-neuron

Propose the first Vision-Language multi-modal Spiking Neural Network for Image-Caption application
Demonstrate to view SNN as Neural-ODE and by analyzing the stability of the ODE to gain adverserial robustness
Extend multimodal Vison-Language aversarial attack to SNN domain and demonstrate the effectiveness compared to naive unimodal implementation