I wanted to explore different ways to optimize PyTorch models for inference, so I played a little bit with TorchScript, ONNX Runtime and classic PyTorch eager-mode and compared their performance. I use pre-trained RoBERTa model (trained for sentiment analysis from tweets) along with BERT tokenizer. Both models are available here.
I wrote 14 short-to-medium length text sequences (7 with positive and 7 with negative sentiments) and I used them for model prediction. To obtain more reliable results, I repeated that process 1000 times (1000 times x 14 sequences = 14K runs for a single model configuration).
Check out results and charts in my GitHub repository.