Adam is an adaptive learning rate optimization algorithm that utilizes both momentum and scaling, combining the benefits of RMSProp and SGD with Momentum. The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The weight updates are performed as:

$$w_{t} = w_{t-1} - \eta\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon}$$

with

$$\hat{m}_{t} = \frac{m_{t}}{1-\beta^{t}_{1}}$$

$$\hat{v}_{t} = \frac{v_{t}}{1-\beta^{t}_{2}}$$

$$m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t}$$

$$v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}$$

$\eta$ is the step size/learning rate, around 1e-3 in the original paper. $\epsilon$ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $\beta_{1}$ and $\beta_{2}$ are forgetting parameters, with typical values 0.9 and 0.999, respectively.