SVT

In this paper, we propose self-supervised training for video transformers using unlabelled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT). Further, owing to the flexibility of Transformer models, SVT supports slow-fast video processing within a single architecture using dynamically adjusted positional encodings and supports long-term relationship modeling along spatiotemporal dimensions. Our approach performs well on four action recognition benchmarks (Kinetics-400, UCF-101, HMDB-51, and SSv2) and converges faster with small batch sizes.

Method	Backbone	TFLOPs	S-Res	T-Res	Epochs	UCF-101	HMDB-51
MemDPC	R2D3D-34	-	224	64	-	54.1	86.1	30.5	54.5
CoCLR	S3D	0.07	128	32	100	77.8	87.9	52.4	54.6
ELo	R(2+1)D	17.5	224	-	100	-	84.2	-	53.7
RSPNet	S3D-G	0.07	112	16	200	-	89.9	-	59.6
VideoMoCo	R(2+1)D	17.5	112	32	200	78.7	-		-
BE	I3D	2.22	224	16	50	-	87.1	-	56.2
CMD	R(2+1)D-26	-	112	16	120	-	85.7	-	54.0
CVRL	R3D-50	3.19	224	32	800	89.2	92.2	57.3	66.7
MoDist	R3D-50	3.19	224	8	100	91.5	94.0	63.0	67.4
BraVe	R3D-50	3.19	224	16	-	92.5	95.1	68.3	74.6
Vi2CLR	S3D	0.07	128	32	300	75.4	89.1	47.3	55.7
ASCNet	S3D-G	0.07	224	64	200	-	90.8	-	60.5
TEC	S3D-G	0.07	128	32	200	-	88.2	-	63.5
LSFD	C3D	-	224	16	-	-	79.8	-	52.1
MCN	R3D	3.19	128	32	50	73.1	89.7	42.9	59.3
CORP	R3D-50	3.19	224	16	800	90.2	93.5	58.7	68.0
SVT (Ours)	ViT-B	0.59	224	16	20	90.8	93.7	57.8	67.2

Method

Backbone

TFLOPs

S-Res

T-Res

Epochs

UCF-101

HMDB-51

Linear

Fine-tune

Linear

Fine-tune

MemDPC

R2D3D-34

224

54.1

86.1

30.5

54.5

CoCLR

S3D

0.07

128

100

77.8

87.9

52.4

54.6

ELo

R(2+1)D

17.5

224

100

84.2

53.7

RSPNet

S3D-G

0.07

112

200

89.9

59.6

VideoMoCo

R(2+1)D

17.5

112

200

78.7

I3D

2.22

224

87.1

56.2

CMD

R(2+1)D-26

112

120

85.7

54.0

CVRL

R3D-50

3.19

224

800

89.2

92.2

57.3

66.7

MoDist

R3D-50

3.19

224

100

91.5

94.0

63.0

67.4

BraVe

R3D-50

3.19

224

92.5

95.1

68.3

74.6

Vi2CLR

S3D

0.07

128

300

75.4

89.1

47.3

55.7

ASCNet

S3D-G

0.07

224

200

90.8

60.5

TEC

S3D-G

0.07

128

200

88.2

63.5

LSFD

C3D

224

79.8

52.1

MCN

R3D

3.19

128

73.1

89.7

42.9

59.3

CORP

R3D-50

3.19

224

800

90.2

93.5

58.7

68.0

SVT (Ours)

ViT-B

0.59

224

90.8

93.7

57.8

67.2

Abstract

Talk

Results across datasets

Try our code

Paper and Supplementary Material