



John Osorio Ríos\*<sup>†</sup><sup>§</sup>, Adrià Armejach<sup>\*†</sup>, Eric Petit<sup>§</sup>, Greg Henry<sup>§</sup> and Marc Casas<sup>\*†</sup>



**Barcelona** Supercomputing Center Centro Nacional de Supercomputación

#### 

### **Evaluating Reduced Numerical Datatypes to Train Deep Neural Networks using PIN**

\* Barcelona Supercomputing Center (BSC) <sup>+</sup> Universitat Politècnica de Catalunya (UPC) § Intel

1

This work has been partially supported by Intel under the BSC-Intel collaboration.





A. Canziani, E. Culurciello, A. Paszke, « An Analysis of Deep Neural Networks Models for Practical Applications », in The 2017 IEEE International Symposium on Circuits & Systems, Baltimore, USA, May 2017.



#### **DNNs Overview**

- The use of Deep Neural Networks is becoming ubiquitous.
- Medicine, sports, chemistry, physics are fields where DNNs are widely used nowadays.
- Models and datasets continue to become deeper and larger. Increasing computational needs.

50



#### Motivation

- like execution time or power.
- approaches rely on reduced computer number formats.
- We propose:
  - A method to evaluate several reduced precision datatype approaches (FASE).

  - A set of compound datatypes relying on a specific datatype.



• Training Deep Neural Networks (DNNs) is a costly task in terms of computational resources

• There are approaches able to reduce training costs without reducing DNNs accuracy. These

• A technique to dynamically adapt the numerical precision during the training phase.





- Training (Link)
- A BF16 FMA is All You Need for DNN Training (Link)



#### Outline

• A Fast, Accurate and Seamless Emulator for Custom Numerical Formats (FASE) (Link) • Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network







A Fast, Accurate and **Seamless Emulator for Custom Numerical** Formats (FASE)

> **Barcelona** Supercomputing Center Centro Nacional de Supercomputación



#### A Fast, Accurate and Seamless Emulator for Custom **Numerical Formats**

- implementations
- It is based on Intel PIN

| Features          | RPE [7] (    | QPyTorch [39 | Verificarlo [4] | FASE                                   |  |
|-------------------|--------------|--------------|-----------------|----------------------------------------|--|
| Fast              | X            | $\checkmark$ | $\checkmark$    | $\checkmark$                           |  |
| Accurate          | $\checkmark$ | ×            | $\checkmark$    | $\checkmark$                           |  |
| Seamless          | ×            | ×            | ×               | $\boldsymbol{X}(\text{recompilation})$ |  |
| Dynamic Libraries | × ×          | ×            | ×               | X(Lib. recompilation)                  |  |
| Independent       | ×            | ×            | $\checkmark$    | $\mathbf{X}($ compiler dep. $)$        |  |



• FASE is a tool that enables the emulation of custom numerical formats on any application. • It enables HW architects to understand numerical behavior before committing to costly HW





#### A Fast, Accurate and Seamless Emulator for Custom **Numerical Formats**

- Coarse-grain granularity (Function level) Ο
- Fine-grain granularity (Instruction level) Ο





• There are various state-of-the-art techniques to emulate reduced precision approaches.





#### • The simplicity is the most important feature of FASE Ο

It emulates code of external dynamically linked libraries Ο



#### **Design Principles**

It enables Fast, Accurate and Seamless emulation of custom numerical formats



#### **Workload Characterization**

#### • We analyze DNN workloads

- 100%approaches







## Implementation







#### **Emulation Accuracy**

- emulation on the Intel MKL SGEMM kernel.



Centro Nacional de Supercomputación

• **Methodology:** We use SGEMM to multiply two matrices using the Intel Math Kernel Library. • **Results:** The figure compares the relative error when employing fine-grain and coarse-grain

#### • Using FASE (fine-grain):

- It is close to what would be observed on real HW
- Able to track errors that accumulate per Ο instruction
- Using coarse-grain:
  - Results more accurate that they should
  - Cannot capture errors that accumulate per instruction





## **Emulation Overhead Measurement**

fine-grain manner the input and output operands to BF16 with RNE rounding.

| Workload              | FASE        | Latency       |                     |                       |      |  |  |
|-----------------------|-------------|---------------|---------------------|-----------------------|------|--|--|
| (framework)           | Instr.      | Unopt         | Opt1<br>Basic block | Opt2<br>Vectorization | Full |  |  |
| SGEMM (MKL)           | $15 \times$ | $1809 \times$ | $880 \times$        | 82 	imes              | 39   |  |  |
| ResNet50 (Caffe)      | $11 \times$ | $1131 \times$ | 553 	imes           | 76 	imes              | 30   |  |  |
| 3DGan (Tensorflow)    | 7	imes      | $714 \times$  | 340 	imes           | 66 	imes              | 28   |  |  |
| LSTM (PyTorch)        | $18 \times$ | $1096 \times$ | 551 	imes           | 70 	imes              | 29   |  |  |
| Transformer (PyTorch) | $8 \times$  | $818 \times$  | $423 \times$        | 36 	imes              | 17   |  |  |



• **Results:** The table shows the emulation latencies introduced by FASE when converting in a



## Large Scale Experiments

- datatypes.

| <b>N</b> /T. 1.1    |                          |        | Accuracy |        |  |
|---------------------|--------------------------|--------|----------|--------|--|
| Model               | Dataset                  | FP32   | BF16     | MP     |  |
| ResNet18            | CIFAR100                 | 71.91% | 71.46%   | 71.89% |  |
| ResNet34            | CIFAR100                 | 73.21% | 72.83%   | 73.86% |  |
| ${ m ResNet50}$     | CIFAR100                 | 74.78% | 69.24%   | 74.25% |  |
| ${ m ResNet101}$    | CIFAR100                 | 75.93% | 67.10%   | 75.65% |  |
| MobileNetV2         | CIFAR100                 | 75.04% | 73.92%   | 75.16% |  |
| AlexNet             | ImageNet                 | 60.79% | 57.80%   | 60.18% |  |
| Inception           | ImageNet                 | 74.01% | 72.03%   | 73.73% |  |
| LSTMx2 (Perplexity) | $\widetilde{\text{PTB}}$ | 86.86  | 137.69   | 87.09  |  |
| Transformers (BLEU) | IWSLT16                  | 34.53  | 34.86    | 34.66  |  |



• Methodology: To show FASE supports real workloads we perform a set of large-scale experiments. These tests consider the use of several DNN models, datasets and numerical

• **Results:** The table shows the results of using FASE for several full DNN training workloads.





#### Conclusions

- and **seamless**.
- emulation.
- workloads.



• We propose FASE, an emulation tool for custom numerical formats. FASE is accurate, fast,

• Our evaluation demonstrates that FASE is more accurate than other state-of-the-art proposals that employ coarse-grain emulation, uncovering relative errors that appear only in fine-grain

• We demonstrate that by applying both the basic block and vectorization optimizations, FASE latency overheads are manageable, ranging between 17× to 39× for a wide variety of





# Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training



Barcelona Supercomputing Center Centro Nacional de Supercomputación



### **State-of-the-Art FMAs for Training**

| Training      | Inputs    | 5    | Output | Multiply  | Accum. |  |
|---------------|-----------|------|--------|-----------|--------|--|
| 8             | A,B       | С    | D      |           |        |  |
| Tensor cores  | FP16/BF16 | FP32 | FP32   | FP16/BF16 | FP32   |  |
| Google TPU v3 | BF16      | FP32 | FP32   | BF16      | FP32   |  |
| AVX512-BF16   | BF16      | FP32 | FP32   | FP32      | FP32   |  |
| Full BF16     | BF16      | BF16 | BF16   | BF16      | BF16   |  |









Static Techniques on ResNet-50

#### **Analysis for Evaluated DNNs**



#### **Dynamic Precision Training**

| 1:  | $numBatchesMP \leftarrow 10$                          |
|-----|-------------------------------------------------------|
| 2:  | $numBatchesBF16 \leftarrow 1000$                      |
| 3:  | $emaThreshold \leftarrow 0.04$                        |
| 4:  |                                                       |
| 5:  | $precisionModeBF16 \leftarrow False$                  |
| 6:  | $countBatchesBF16 \leftarrow 0$ //                    |
| 7:  | $numBatchesTrain \leftarrow numBatches$               |
| 8:  |                                                       |
| 9:  | for $i = 0$ to niter do                               |
| 10: | train.step(numBatchesTrain)                           |
| 11: | $trainingLoss[i] \leftarrow train.traininglistics[i]$ |
| 12: | if $i = 5$ then                                       |
| 13: | $EMA \leftarrow average(trainingLos$                  |
| 14: | if $i > 5$ then                                       |
| 15: | $EMA prev \leftarrow EMA$                             |
| 16: | $EMA \leftarrow emaCalculation(trace)$                |
| 17: | if $(precisionModeBF16! = Tr$                         |
| 18: | $if ((EMAprev - EMA) > \epsilon$                      |
| 19: | $precisionModeBF16 \leftarrow$                        |
| 20: | changeToBF16()                                        |
| 21: | else                                                  |
| 22: | $countBatchesBF16 \leftarrow countBatchesBF16$        |
| 23: | if $(countBatchesBF16 = nc)$                          |
| 24: | if $((EMAprev - EMA))$                                |
| 25: | $countBatchesBF16 \leftarrow$                         |
| 26: | else                                                  |
| 27: | precision Mode BF16                                   |
| 28: | changeToMP()                                          |
| 29: | $countBatchesBF16 \leftarrow$                         |
|     |                                                       |



Barcelona

// Number of consecutive MP batches
// Number of consecutive BF16 batches
// Defines EMA reduction threshold

// Indicates current precision mode, True means BF16
/ Counts how many numBatchesBF16 have been executed sMP // Number of batches per training loop iteration

// numBatchesTrain batches precisionModeBF16
ngLoss
// Initial history to calculate EMA
pss)

ainingLoss, EMAprev) // Each numBatchesMP
rue) then
emaThreshold) then // If training loss goes down
- True

// Switch precision to BF16



## **Object Classification DNNs**



AlexNet







Inception

ResNet-50



## **Object Classification DNNs**

| Model     | Epoch      | FP     | FP32   |        | MP     |        | Dynamie | C       | <b>BF16</b> |        |  |
|-----------|------------|--------|--------|--------|--------|--------|---------|---------|-------------|--------|--|
|           | - <b>r</b> | Top-1  | Top-5  | Top-1  | Top-5  | Top-1  | Top-5   | BF16FMA | Top-1       | Top-5  |  |
| AlexNet   | 32         | 60.79% | 84.50% | 60.18% | 84.43% | 60.32% | 84.02%  | 94.60%  | 57.80%      | 82.56% |  |
| Inception | 16         | 74.01% | 92.36% | 73.73% | 92.67% | 72.80% | 92.02%  | 95.55%  | 72.03%      | 92.05% |  |
| ResNet-50 | 32         | 75.96% | 93.37% | 75.70% | 93.20% | 74.20% | 92.70%  | 96.40%  | 72.97%      | 92.30% |  |







- Full BF16 FMA instructions fail to deliver comparable accuracy levels.
- We proposed a *Dynamic* training technique that performs up-to 94.6% of FMAs using full BF16 ones.
- We used Caffe and PyTorch to show the versatility of FASE to work seamlessly on different DNN frameworks



#### Conclusions



# A BF16 FMA is All You Need for DNN Training



Barcelona Supercomputing Center Centro Nacional de Supercomputación

22



#### Introduction

- First approach to train state-of-the-art DNNs entirely using the BF16 format • We propose a new class of FMA operators,  $FMA_{N M}^{BF16}$  They represent operands A and B using N BF16 literals (BF16xN) • Input C and output D use M BF16 literals (BF16M)







#### The BF16xN Data Representation

mantissa bits.





• The BF16xN data representation format is a compound datatype composed of N BF16 literals. The BF16x1 format uses 1-bit and 8-bits storage for sign and exponent, like FP32, and 7 explicit



# Characterization of FMA<sub>N M</sub><sup>BF16</sup> Units

| FMA <sup>BF16</sup>                 | FMA <sup>BF16</sup> | FMA <sup>BF16</sup> | FMA <sup>BF16</sup> | FMA <sup>BF16</sup> {3} | FMA <sup>BF16</sup> {4} | FMA <sup>BF16</sup> {6} | FMA <sup>BF16</sup> {9} | F  |
|-------------------------------------|---------------------|---------------------|---------------------|-------------------------|-------------------------|-------------------------|-------------------------|----|
| Multiplier mantissa bits            | 8                   | 8                   | 8                   | $[15, 16^*]$            | 16                      | $[23, 24^{**}]$         | 24                      |    |
| Maximum input bitwidth              | 16                  | 32                  | 48                  | 32                      | 32                      | 48                      | 48                      |    |
| # BF16 multiplications              | 1                   | 1                   | 1                   | 3                       | 4                       | 6                       | 9                       | ]] |
| # Area Units                        | 64                  | 64                  | 64                  | 192                     | 256                     | 384                     | 576                     |    |
| Speed-up wrt FP32 (equivalent area) | 9.0×                | 9.0×                | 9.0×                | 3.0×                    | 2.3 	imes               | $1.5 \times$            | $1.0 \times$            | 1  |



• To characterize our  $FMA_{N M}^{BF16}$  units we use the observation that the area of an FMA is dominated by the multiplier as it grows quadratically with mantissa size. An FP32 FMA requires  $24^2 = 576$  area units, while an FMA with BF16 multiplier inputs would require just  $8^2 = 64$  units.









• The figure shows the results obtained when training ResNet101 using CIFAR100 dataset





#### **Evaluation**

## • FMA<sub>222</sub><sup>BF16</sup>{3} outperforms the other operators while keep using BF16 during the whole training time

| •       |       |     |    |    |                            |                     |
|---------|-------|-----|----|----|----------------------------|---------------------|
|         | •     | • • | •  | •  | FP32                       | •                   |
| •       |       |     |    |    | MP                         |                     |
|         |       |     |    | •  | $FMA_{1\_1}^{BI}$          | 716<br>L            |
|         |       |     |    | ▼  | $FMA_{1}^{BI}$             | 716<br>2            |
|         |       |     |    |    | $FMA_{1}^{BI}$             | 716<br>3            |
|         |       |     |    | •  | $FMA_{2_2}^{BI}$           | ${}_{2}^{716}{3}$   |
|         |       |     |    |    | $FMA_{2}^{BI}$             | ${}_{2}^{716}{4}$   |
|         |       |     |    |    | FMA <sub>3_3</sub>         | ${}^{716}_{3}{6}$   |
|         |       |     |    |    | $\mathrm{FMA}_{3\_3}^{BI}$ | ${}^{716}_{3}\{9\}$ |
| 80<br>E | pochs | 100 | 12 | 20 | 140                        | 160                 |
|         |       |     |    |    |                            |                     |



#### Conclusions

- hardware instructions but delivers FP32 training accuracy.
- In contrast with previous implementations, we do not employ FP32 routines
- All FMA instructions use BF16 arithmetic for the whole training process
- We evaluate the operators on seven different DNN workloads ResNet18, ResNet34, ResNet50, ResNet101 and MobileNetV2 on CIFAR10/100
  - LSTMx2 on PTB dataset
  - A transformer-based model on the IWSLT16 dataset



• We propose a new class of FMA operators,  $FMA_{N\ M}^{BF16}$ , that entirely relies on BF16 FMA





#### **Future Work**

- Support AMX extensions on FASE
- Evaluate other reduced precision datatypes • FP8, INT8, INT4
  - Dynamic compound datatypes
  - Evaluation of possible new numerical datatypes









#### Barcelona Supercomputing Center

Centro Nacional de Supercomputación

#### Intel®

#### THANKS

John Osorio Ríos (john.osorio.rios@intel.com) Adria Armejach (adria.armejach@bsc.es)