Automatic Differentiation with Keras in R

machine learning deep learning non-linear programming keras r

In this post we will explore the keras API and how it handles low level mathematical operations to perform automatic differentiation.

Blake Conrad https://github.com/conradbm
2022-07-17

Automatic Differentiation Examples

Automatic differentiation is arguable one of the greatest inventions of all time. It has completely changed the face of modern deep learning. Some of the worlds hardest problems, like self-driving cars, actually become computationally feasible (and with ease) because of this breakthrough. I personally never dove into using it, but gently knew of it from years ago being mentioned. Today the goal is to walk through some boiler plate from the developers themselves, then try to expand on it with a small example to open up the mind. Let’s get started.

Basic Example

We are going to walk through several examples of using the GradientTape function built in the keras API. These snippets are graciously taken from the keras tutorial in the references. The Sandbox section will go into an example using some of these ideas for an example use case.

# A variable matrix x, initialized to ones
x <- tf$ones(shape(2, 2))
x
tf.Tensor(
[[1. 1.]
 [1. 1.]], shape=(2, 2), dtype=float32)
with(tf$GradientTape() %as% tape, {
  
  # Observe operations on x
  tape$watch(x)
  
  y <- tf$reduce_sum(x) # y(x) = sum(x)
  z <- tf$multiply(y, y) # z(y) = y(x)^2
  # z = sum{x}^2
})

# Now z is a function of x on the tape
dz_dx = tape$gradient(z, x) # dz = 2*y(x)dy_dx = 2*4*1 = 8; for all x [2x2]
dz_dx
tf.Tensor(
[[8. 8.]
 [8. 8.]], shape=(2, 2), dtype=float32)

\[ \frac{dz}{dy} = 2*y(x)*\frac{dy}{dx} = 2*y(x)(\frac{dy}{dx_j}\sum_{i}{x_i})=2*4*1=8 \:\forall x_j \]

x <- tf$ones(tensorflow::shape(2,2))

with(tf$GradientTape() %as% tape, {
  tape$watch(x)
  y <- tf$reduce_sum(x) # y(x) = sum(x)
  z <- tf$multiply(y, y) # z(y) = y(x)^2
})

dz_dy = tape$gradient(z, y) # dz_dy = 2*y(x) = 2*4 = 8; for all y [just 1]
dz_dy
tf.Tensor(8.0, shape=(), dtype=float32)

\[ \frac{dz}{dy} = 2*y(x) = 2\sum{x}=2 * 4 = 8 \]

x <- tf$constant(3)

with(tf$GradientTape(persistent = TRUE) %as% tape, {
  tape$watch(x)
  y <- x * x # y(x) = x^2
  z <- y * y # z(y) = y(x)^2
})

dz_dx <- tape$gradient(z, x) # dz_dx = 2*y(x)*dy_dx = 2*x^2*2*x = 2*(3^2)*2*3
dz_dx
tf.Tensor(108.0, shape=(), dtype=float32)

\[ \frac{dz}{dx} = 2*y(x)*\frac{dy}{dx} = 2*x^2*2*x = 2*(3^2)*2*3=108 \]

dy_dx <- tape$gradient(y, x) # 6.0
dy_dx
tf.Tensor(6.0, shape=(), dtype=float32)

\[ \frac{dy}{dx} = 2x=2*3=6 \]

f <- function(x, y) {
  output <- 1
  for (i in seq_len(y)) {
    if (i > 2 & i <= 5)
      output = tf$multiply(output, x)
  }
  output
}

grad <- function(x, y) {
  with(tf$GradientTape() %as% t, {
    t$watch(x)
    out <- f(x, y)
  })
  t$gradient(out, x)
}

x <- tf$constant(2)
grad(x, 6)
tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 5)
tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 4)
tf.Tensor(4.0, shape=(), dtype=float32)
x <- tf$Variable(1.0)  # Create a Tensorflow variable initialized to 1.0

Advanced Example

x <- tf$Variable(1)  # Create a Tensorflow variable initialized to 1.0
x
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>
with(tf$GradientTape() %as% t, {
  
  with(tf$GradientTape() %as% t2, {
    y <- x*x*x
  })
  
  # Compute the gradient inside the 't' context manager
  # which means the gradient computation is differentiable as well.
  dy_dx <- t2$gradient(y, x)
  
})

d2y_dx <- t$gradient(dy_dx, x)

# Momentum of change
d2y_dx
tf.Tensor(6.0, shape=(), dtype=float32)
# Rate of change
dy_dx
tf.Tensor(3.0, shape=(), dtype=float32)

Sandbox

Let’s imagine for a second we had a really complex function, and we knew the type of variables that we were going to feed into it, but calculating the gradient might be extremely difficult. Let’s say, we wanted to calculate a matrix of derivatives (or a tensor) of 10x10.

X = tf$Variable(matrix(rnorm(100), nrow=10, ncol=10))
X
<tf.Variable 'Variable:0' shape=(10, 10) dtype=float64, numpy=
array([[-1.73309684, -0.08322157,  0.54220219, -2.74995815,  1.51813321,
         0.54784681,  0.15225096,  1.13755715,  0.38011554,  0.24894704],
       [-2.41029604,  0.09891428, -0.35209566, -0.57222969, -0.53634412,
         0.71086853, -0.7006611 , -1.0918934 , -0.518368  ,  0.47158885],
       [ 1.95843168, -0.38393546, -0.17446505, -0.75403039,  0.52051699,
        -0.59089743, -0.08355002,  1.84192727, -1.67958722, -0.03590348],
       [ 0.64239372,  0.96948581,  1.18550871,  0.31853438, -0.01353529,
        -1.50215088,  1.16325942,  2.53564633, -0.14070148,  0.80219709],
       [-1.58774512, -0.36109838, -1.17602407,  2.56813718,  2.17165653,
         0.31225676, -0.07452   , -2.58692503,  0.24899347, -1.11514666],
       [ 0.16152645,  0.71037364,  0.00398511, -0.30448759,  0.2536505 ,
         0.62954953, -1.05880117,  0.26962006,  0.05717959, -0.72830774],
       [-0.39773135, -0.82909675, -0.25358638, -0.59576887, -1.21472404,
         0.23462385,  0.64637421, -1.13077943,  0.12405466,  1.48119644],
       [ 1.2187064 ,  0.9232517 , -1.30404339,  0.21532797, -1.01642429,
        -0.57886662,  0.37538024, -0.50451042, -0.61835005,  1.41595008],
       [-0.09359497, -1.87030098,  2.09480425, -0.5700717 , -0.71985003,
        -1.42181604, -0.45254259,  0.04929961,  1.39375553, -0.82066383],
       [ 1.55365989, -0.47480949, -0.08264721, -1.49560904,  0.95268159,
        -0.08616004,  0.13853932,  0.45007053, -0.82517614, -0.07131134]])>

These are the initial values of the matrix, but we want this matrix to match a response matrix as closely as possible.

R = matrix(rbinom(100, 4, 0.25), nrow = 10, ncol = 10)
R
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    3    0    1    2    0    1    2    2    1     1
 [2,]    2    2    0    0    1    1    0    1    2     0
 [3,]    2    2    2    2    0    1    0    1    1     0
 [4,]    1    0    1    1    1    3    1    1    2     2
 [5,]    0    2    2    1    1    1    0    0    1     2
 [6,]    1    1    0    0    1    1    2    0    1     1
 [7,]    2    2    1    0    0    1    0    1    0     0
 [8,]    0    0    2    0    1    2    2    0    1     1
 [9,]    1    2    0    2    0    0    1    3    0     2
[10,]    1    1    1    1    0    2    2    2    0     0

So the goal is to fill in the gaps, for say, recommender systems or something. Let’s build a loss function and go for it.

X = tf$Variable(matrix(rnorm(100), nrow=10, ncol=10))


# Consider a loss function L
with(tf$GradientTape(persistent = T) %as% t, {
  t$watch(X)
  
  L <- tf$reduce_sum(
          tf$square(
            tf$subtract(R,X)
          )
      )
  
  
  lambda = 0.2
  reg <- lambda*tf$reduce_sum(tf$sqrt(tf$square(X)))
  
  L <- tf$add(L, reg)
})
  
# For 100 steps, go descend
for(i in 1:1000){
  
  # learning rate
  alpha = 0.1/i
  
  # gradient
  dL_dX <- t$gradient(L, X)
  
  # derivative
  X$assign_sub(alpha*dL_dX)
}

X
<tf.Variable 'Variable:0' shape=(10, 10) dtype=float64, numpy=
array([[ 5.02293261,  0.30973319,  0.99186468,  3.41146814,  0.31291584,
         1.34556413,  3.44241753,  2.49430854,  1.22441611,  2.15290765],
       [ 3.48498694,  3.89164399,  0.18661404,  0.32108998,  0.76755848,
         2.19275776,  0.19756001,  1.83416693,  3.15623643,  0.60512031],
       [ 3.20935864,  3.6486702 ,  3.15430177,  4.32273572, -0.65953477,
         1.21454801, -0.25736091,  1.65204631,  1.75664996, -0.85084353],
       [ 1.99710618, -0.58446189,  1.84274513,  2.23608402,  0.58083008,
         4.89535097,  0.82008945,  1.96283637,  4.25830029,  2.60318369],
       [ 0.4709209 ,  3.21099017,  3.75649051,  1.31540595,  0.24795211,
         1.23813381,  0.29598465, -0.19368636,  1.05168273,  2.51631485],
       [ 2.28950946,  1.78471934, -0.30089427,  0.22914342,  0.57931684,
         1.71634635,  2.10374215, -0.60281344,  1.23714651,  0.53926587],
       [ 3.21948567,  2.61709118,  1.73669846,  0.18467006, -0.82253731,
         0.89098891, -0.31231532,  1.74370668, -0.19812191, -0.79820555],
       [-0.33663196,  0.16062981,  3.35775843,  0.47308834,  1.01396689,
         3.37754752,  2.47726712, -0.49023551,  1.11100834,  0.79643726],
       [ 0.78515   ,  3.17968732, -0.6235009 ,  2.72424966, -0.22137149,
        -0.7216236 ,  1.85288206,  2.83845993, -0.50285423,  3.79311637],
       [ 1.310561  ,  1.96721478,  0.71569936,  1.20825038,  0.59959347,
         3.1721326 ,  2.63536597,  3.76609742, -0.48195342,  1.0897197 ]])>
R
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    3    0    1    2    0    1    2    2    1     1
 [2,]    2    2    0    0    1    1    0    1    2     0
 [3,]    2    2    2    2    0    1    0    1    1     0
 [4,]    1    0    1    1    1    3    1    1    2     2
 [5,]    0    2    2    1    1    1    0    0    1     2
 [6,]    1    1    0    0    1    1    2    0    1     1
 [7,]    2    2    1    0    0    1    0    1    0     0
 [8,]    0    0    2    0    1    2    2    0    1     1
 [9,]    1    2    0    2    0    0    1    3    0     2
[10,]    1    1    1    1    0    2    2    2    0     0
# Final loss
E <- (as.matrix(X) - R)^2 %>%
  as.data.frame %>%
  dplyr::mutate(from=paste("V",row_number(),sep="")) %>% 
  tidyr::gather(., key = "to", value = "error", -from)
ggplot(data=E) + 
  geom_density(aes(x=error, fill=to)) +
  facet_wrap(~to) + theme_classic() +
  labs(fill="Destination Variable") +
  theme_classic() +
  xlab("Error") +
  ylab("Probability") +
  ggtitle("Reconstruction Matrix Error", subtitle = paste("Total Error: ", round(sum((as.matrix(X) - R)^2), 2), " units", sep="" ))

E %>% ggplot(., aes(x=to, y=from, fill=error)) + 
  geom_tile() + 
  scale_fill_gradient(low = "lightgreen", high = "red") +
  labs(fill = "Reconstruction Error") +
  theme_classic() +
  ggtitle("Reconstruction Matrix Error", subtitle = paste("Total Error: ", round(sum((as.matrix(X) - R)^2), 2), " units", sep="" ))

Pretty remarkable. It went out and learned how to match our R data matrix to minimize squared error with a penalty for large noisy matrices.

References

https://tensorflow.rstudio.com/tutorials/advanced/customization/autodiff/ https://keras.rstudio.com/articles/eager_guide.html