In this post we will explore the keras API and how it handles low level mathematical operations to perform automatic differentiation.
Automatic differentiation is arguable one of the greatest inventions of all time. It has completely changed the face of modern deep learning. Some of the worlds hardest problems, like self-driving cars, actually become computationally feasible (and with ease) because of this breakthrough. I personally never dove into using it, but gently knew of it from years ago being mentioned. Today the goal is to walk through some boiler plate from the developers themselves, then try to expand on it with a small example to open up the mind. Let’s get started.
We are going to walk through several examples of using the
GradientTape function built in the keras API. These
snippets are graciously taken from the keras tutorial in the references.
The Sandbox section will go into an example using some of
these ideas for an example use case.
# A variable matrix x, initialized to ones
x <- tf$ones(shape(2, 2))
x
tf.Tensor(
[[1. 1.]
[1. 1.]], shape=(2, 2), dtype=float32)
with(tf$GradientTape() %as% tape, {
# Observe operations on x
tape$watch(x)
y <- tf$reduce_sum(x) # y(x) = sum(x)
z <- tf$multiply(y, y) # z(y) = y(x)^2
# z = sum{x}^2
})
# Now z is a function of x on the tape
dz_dx = tape$gradient(z, x) # dz = 2*y(x)dy_dx = 2*4*1 = 8; for all x [2x2]
dz_dx
tf.Tensor(
[[8. 8.]
[8. 8.]], shape=(2, 2), dtype=float32)
\[ \frac{dz}{dy} = 2*y(x)*\frac{dy}{dx} = 2*y(x)(\frac{dy}{dx_j}\sum_{i}{x_i})=2*4*1=8 \:\forall x_j \]
x <- tf$ones(tensorflow::shape(2,2))
with(tf$GradientTape() %as% tape, {
tape$watch(x)
y <- tf$reduce_sum(x) # y(x) = sum(x)
z <- tf$multiply(y, y) # z(y) = y(x)^2
})
dz_dy = tape$gradient(z, y) # dz_dy = 2*y(x) = 2*4 = 8; for all y [just 1]
dz_dy
tf.Tensor(8.0, shape=(), dtype=float32)
\[ \frac{dz}{dy} = 2*y(x) = 2\sum{x}=2 * 4 = 8 \]
x <- tf$constant(3)
with(tf$GradientTape(persistent = TRUE) %as% tape, {
tape$watch(x)
y <- x * x # y(x) = x^2
z <- y * y # z(y) = y(x)^2
})
dz_dx <- tape$gradient(z, x) # dz_dx = 2*y(x)*dy_dx = 2*x^2*2*x = 2*(3^2)*2*3
dz_dx
tf.Tensor(108.0, shape=(), dtype=float32)
\[ \frac{dz}{dx} = 2*y(x)*\frac{dy}{dx} = 2*x^2*2*x = 2*(3^2)*2*3=108 \]
dy_dx <- tape$gradient(y, x) # 6.0
dy_dx
tf.Tensor(6.0, shape=(), dtype=float32)
\[ \frac{dy}{dx} = 2x=2*3=6 \]
f <- function(x, y) {
output <- 1
for (i in seq_len(y)) {
if (i > 2 & i <= 5)
output = tf$multiply(output, x)
}
output
}
grad <- function(x, y) {
with(tf$GradientTape() %as% t, {
t$watch(x)
out <- f(x, y)
})
t$gradient(out, x)
}
x <- tf$constant(2)
grad(x, 6)
tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 5)
tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 4)
tf.Tensor(4.0, shape=(), dtype=float32)
x <- tf$Variable(1.0) # Create a Tensorflow variable initialized to 1.0
x <- tf$Variable(1) # Create a Tensorflow variable initialized to 1.0
x
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>
with(tf$GradientTape() %as% t, {
with(tf$GradientTape() %as% t2, {
y <- x*x*x
})
# Compute the gradient inside the 't' context manager
# which means the gradient computation is differentiable as well.
dy_dx <- t2$gradient(y, x)
})
d2y_dx <- t$gradient(dy_dx, x)
# Momentum of change
d2y_dx
tf.Tensor(6.0, shape=(), dtype=float32)
# Rate of change
dy_dx
tf.Tensor(3.0, shape=(), dtype=float32)
Let’s imagine for a second we had a really complex function, and we knew the type of variables that we were going to feed into it, but calculating the gradient might be extremely difficult. Let’s say, we wanted to calculate a matrix of derivatives (or a tensor) of 10x10.
<tf.Variable 'Variable:0' shape=(10, 10) dtype=float64, numpy=
array([[-1.73309684, -0.08322157, 0.54220219, -2.74995815, 1.51813321,
0.54784681, 0.15225096, 1.13755715, 0.38011554, 0.24894704],
[-2.41029604, 0.09891428, -0.35209566, -0.57222969, -0.53634412,
0.71086853, -0.7006611 , -1.0918934 , -0.518368 , 0.47158885],
[ 1.95843168, -0.38393546, -0.17446505, -0.75403039, 0.52051699,
-0.59089743, -0.08355002, 1.84192727, -1.67958722, -0.03590348],
[ 0.64239372, 0.96948581, 1.18550871, 0.31853438, -0.01353529,
-1.50215088, 1.16325942, 2.53564633, -0.14070148, 0.80219709],
[-1.58774512, -0.36109838, -1.17602407, 2.56813718, 2.17165653,
0.31225676, -0.07452 , -2.58692503, 0.24899347, -1.11514666],
[ 0.16152645, 0.71037364, 0.00398511, -0.30448759, 0.2536505 ,
0.62954953, -1.05880117, 0.26962006, 0.05717959, -0.72830774],
[-0.39773135, -0.82909675, -0.25358638, -0.59576887, -1.21472404,
0.23462385, 0.64637421, -1.13077943, 0.12405466, 1.48119644],
[ 1.2187064 , 0.9232517 , -1.30404339, 0.21532797, -1.01642429,
-0.57886662, 0.37538024, -0.50451042, -0.61835005, 1.41595008],
[-0.09359497, -1.87030098, 2.09480425, -0.5700717 , -0.71985003,
-1.42181604, -0.45254259, 0.04929961, 1.39375553, -0.82066383],
[ 1.55365989, -0.47480949, -0.08264721, -1.49560904, 0.95268159,
-0.08616004, 0.13853932, 0.45007053, -0.82517614, -0.07131134]])>
These are the initial values of the matrix, but we want this matrix to match a response matrix as closely as possible.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 0 1 2 0 1 2 2 1 1
[2,] 2 2 0 0 1 1 0 1 2 0
[3,] 2 2 2 2 0 1 0 1 1 0
[4,] 1 0 1 1 1 3 1 1 2 2
[5,] 0 2 2 1 1 1 0 0 1 2
[6,] 1 1 0 0 1 1 2 0 1 1
[7,] 2 2 1 0 0 1 0 1 0 0
[8,] 0 0 2 0 1 2 2 0 1 1
[9,] 1 2 0 2 0 0 1 3 0 2
[10,] 1 1 1 1 0 2 2 2 0 0
So the goal is to fill in the gaps, for say, recommender systems or something. Let’s build a loss function and go for it.
X = tf$Variable(matrix(rnorm(100), nrow=10, ncol=10))
# Consider a loss function L
with(tf$GradientTape(persistent = T) %as% t, {
t$watch(X)
L <- tf$reduce_sum(
tf$square(
tf$subtract(R,X)
)
)
lambda = 0.2
reg <- lambda*tf$reduce_sum(tf$sqrt(tf$square(X)))
L <- tf$add(L, reg)
})
# For 100 steps, go descend
for(i in 1:1000){
# learning rate
alpha = 0.1/i
# gradient
dL_dX <- t$gradient(L, X)
# derivative
X$assign_sub(alpha*dL_dX)
}
X
<tf.Variable 'Variable:0' shape=(10, 10) dtype=float64, numpy=
array([[ 5.02293261, 0.30973319, 0.99186468, 3.41146814, 0.31291584,
1.34556413, 3.44241753, 2.49430854, 1.22441611, 2.15290765],
[ 3.48498694, 3.89164399, 0.18661404, 0.32108998, 0.76755848,
2.19275776, 0.19756001, 1.83416693, 3.15623643, 0.60512031],
[ 3.20935864, 3.6486702 , 3.15430177, 4.32273572, -0.65953477,
1.21454801, -0.25736091, 1.65204631, 1.75664996, -0.85084353],
[ 1.99710618, -0.58446189, 1.84274513, 2.23608402, 0.58083008,
4.89535097, 0.82008945, 1.96283637, 4.25830029, 2.60318369],
[ 0.4709209 , 3.21099017, 3.75649051, 1.31540595, 0.24795211,
1.23813381, 0.29598465, -0.19368636, 1.05168273, 2.51631485],
[ 2.28950946, 1.78471934, -0.30089427, 0.22914342, 0.57931684,
1.71634635, 2.10374215, -0.60281344, 1.23714651, 0.53926587],
[ 3.21948567, 2.61709118, 1.73669846, 0.18467006, -0.82253731,
0.89098891, -0.31231532, 1.74370668, -0.19812191, -0.79820555],
[-0.33663196, 0.16062981, 3.35775843, 0.47308834, 1.01396689,
3.37754752, 2.47726712, -0.49023551, 1.11100834, 0.79643726],
[ 0.78515 , 3.17968732, -0.6235009 , 2.72424966, -0.22137149,
-0.7216236 , 1.85288206, 2.83845993, -0.50285423, 3.79311637],
[ 1.310561 , 1.96721478, 0.71569936, 1.20825038, 0.59959347,
3.1721326 , 2.63536597, 3.76609742, -0.48195342, 1.0897197 ]])>
R
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 0 1 2 0 1 2 2 1 1
[2,] 2 2 0 0 1 1 0 1 2 0
[3,] 2 2 2 2 0 1 0 1 1 0
[4,] 1 0 1 1 1 3 1 1 2 2
[5,] 0 2 2 1 1 1 0 0 1 2
[6,] 1 1 0 0 1 1 2 0 1 1
[7,] 2 2 1 0 0 1 0 1 0 0
[8,] 0 0 2 0 1 2 2 0 1 1
[9,] 1 2 0 2 0 0 1 3 0 2
[10,] 1 1 1 1 0 2 2 2 0 0
ggplot(data=E) +
geom_density(aes(x=error, fill=to)) +
facet_wrap(~to) + theme_classic() +
labs(fill="Destination Variable") +
theme_classic() +
xlab("Error") +
ylab("Probability") +
ggtitle("Reconstruction Matrix Error", subtitle = paste("Total Error: ", round(sum((as.matrix(X) - R)^2), 2), " units", sep="" ))

E %>% ggplot(., aes(x=to, y=from, fill=error)) +
geom_tile() +
scale_fill_gradient(low = "lightgreen", high = "red") +
labs(fill = "Reconstruction Error") +
theme_classic() +
ggtitle("Reconstruction Matrix Error", subtitle = paste("Total Error: ", round(sum((as.matrix(X) - R)^2), 2), " units", sep="" ))

Pretty remarkable. It went out and learned how to match our
R data matrix to minimize squared error with a penalty for
large noisy matrices.
https://tensorflow.rstudio.com/tutorials/advanced/customization/autodiff/ https://keras.rstudio.com/articles/eager_guide.html