Can transfomer take 2 dimensional input like [1, 3,224, 224]?

I would that to ask if Transformer (self attention network) can take input of size [1, 3, 224, 224].