# Difference between revisions of "GradientLess Descent"

(→Critiques) |
|||

(41 intermediate revisions by 10 users not shown) | |||

Line 1: | Line 1: | ||

+ | == Presented By == | ||

+ | Jose Avilez | ||

==Introduction== | ==Introduction== | ||

+ | |||

+ | In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives. | ||

===Motivation and Set-up=== | ===Motivation and Set-up=== | ||

Line 25: | Line 29: | ||

f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2 | f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2 | ||

\end{align*} | \end{align*} | ||

− | + | <math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if | |

\begin{align*} | \begin{align*} | ||

f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2 | f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2 | ||

\end{align*} | \end{align*} | ||

− | + | <math display="inline"> \forall x,y \in K </math> | |

− | We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline"> | + | We remark that if <math display="inline">f</math> is twice continuously differentiable, then this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}(f)</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds. |

[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]] | [[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]] | ||

In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>. | In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>. | ||

− | The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of | + | The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of an <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary. |

− | |||

===Zeroth-Order Optimisation=== | ===Zeroth-Order Optimisation=== | ||

− | In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as: | + | In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available, or they may be difficult to compute, such as: |

* Combinatorial (i.e. discrete) optimisation | * Combinatorial (i.e. discrete) optimisation | ||

Line 51: | Line 54: | ||

Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms. | Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms. | ||

− | This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. | + | This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension. |

==GradientLess Descent Algorithm== | ==GradientLess Descent Algorithm== | ||

Line 59: | Line 62: | ||

[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]] | [[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]] | ||

− | Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function. | + | Observe that at each step, we perform a binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function. |

− | ==Proof of correctness== | + | ===Proof of correctness=== |

The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap. | The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap. | ||

Line 68: | Line 71: | ||

Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>. | Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>. | ||

+ | |||

+ | |||

+ | Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm. | ||

+ | |||

+ | '''Theorem 2''' | ||

+ | |||

+ | <math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies | ||

+ | \begin{align*} | ||

+ | f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right) | ||

+ | \end{align*} | ||

+ | with probability at least <math display="inline">\frac{1}{4}</math>. | ||

+ | |||

+ | |||

+ | Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is. | ||

+ | |||

+ | For proof of these theorems, please watch my talk. | ||

+ | |||

+ | [[File: GLD2.PNG|frame| Gradientless Descent with Fast Binary Search.]] | ||

+ | |||

+ | In the current form of GradientLess Descent Algorithm presented here, the lower and upper limits of the search radius i.e. <math display="inline">[r, R]</math> remain unchanged for the entire run of the algorithm. As proven by the correctness of this algorithm, this does ensure convergence but this version of the algorithm does not take advantage of the upper bound of the condition number <math display="inline">Q</math> and therefore, has an extra factor of <math display="inline">\log \frac{1}{\epsilon}</math> in its overall cost. | ||

+ | |||

+ | A variation of this algorithm termed '''Gradientless Descent with Fast Binary Search (GLD-Fast)''', eliminates this additional factor from the overall cost through reduction in the range of the binary search by shrinking <math display="inline">R</math> in half after every <math display="inline">H</math> iterations (where <math display="inline">H</math> is determined by <math display="inline">Q</math>). | ||

+ | |||

+ | |||

+ | |||

+ | For determining K and H, use the following equations: | ||

+ | |||

+ | K = log(4√Q) | ||

+ | |||

+ | H = nQ log(Q) | ||

+ | |||

+ | ==Results== | ||

+ | |||

+ | We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011. | ||

+ | |||

+ | [[File:GLDBeatsARS.PNG|1000px|]] | ||

+ | |||

+ | For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark. | ||

+ | |||

+ | ==Conclusion== | ||

+ | This research paper has analysed a randomised algorithm where a search direction is sampled from the standard Gaussian. This is a direct search-based algorithm, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search all the possible values of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient. | ||

+ | |||

+ | ==Critiques== | ||

+ | |||

+ | 1- Although the theoretical guarantees presented in the paper are interesting, this is not clear how this algorithm is applicable in practice. This is because this paper assumes we do not have access to the objective function, and they are only able to use function evaluations. Besides, there is a strong assumption that the function is smooth and strongly convex. Considering this, my main concern is how we can make sure the objective function is smooth and strongly convex while we do not have access to it explicitly? (if we have explicit access to the function and this is smooth and strongly convex, why shouldn't we use gradient-based methods?!) Further, what happens if the objective function violates the smooth and strongly convex condition? Can we still employ this algorithm? | ||

+ | |||

+ | In response to the above comments: | ||

+ | |||

+ | This algorithm has many practical applications especially in the field of reinforcement learning. A major concept in reinforcement learning is the concept of a reward function (which we either wish to minimise or maximise). In particular, the reward function may be hidden behind a black box. For example, consider a "theoretical" slot machine where we only see how much money we get if we win, but do not know how the amount is determined. It is true that in general, these objective functions may not be smooth or strongly convex, but one is able to either make certain assumptions about the reward function or relax certain conditions about the state of the world in order to create a reward function that is smooth or convex. Additionally, certain gradients may not have an analytical form, in which case numerical calculation for gradients may be computationally expensive. This method allows a way to bypass the gradient computations altogether! | ||

+ | |||

+ | To back the response: | ||

+ | They have demonstrated that their algorithm can be successfully applied to '''MuJoCo''' benchmarks, where the objective function is '''not''' strongly convex and smooth. | ||

+ | - providing more graphical representation in proving lemmas, would make the paper more fathomable. | ||

+ | |||

+ | 2 - Currently only a single synthetic function is observed and experimented within the paper ( It is also under a single monotone exponential transformation ). Using a different function we would be sure the algorithm works well in practice. | ||

+ | |||

+ | ==Bibliography== | ||

+ | |||

+ | 1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimisation". In: arXiv preprint arXiv:1911.06317 (2019). | ||

+ | |||

+ | 2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70. | ||

+ | |||

+ | 3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimisation of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566. | ||

+ | |||

+ | 4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970. |

## Latest revision as of 21:06, 9 December 2020

## Contents

## Presented By

Jose Avilez

## Introduction

In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.

### Motivation and Set-up

A general optimisation question can be formulated by asking to minimise an objective function [math]f : \mathbb{R}^n \to \mathbb{R}[/math], which means finding: \begin{align*} x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x) \end{align*}

Depending on the nature of [math]f[/math], different settings may be considered:

- Convex vs non-convex objective functions;
- Differentiable vs non-differentiable objective functions;
- Allowed function or gradient computations;
- Noisy/Stochastic oracle access.

For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.

To be even more precise, in our context we let [math]K \subseteq \mathbb{R}^n[/math] be compact [math]f : K \to \mathbb{R}[/math] be [math]\beta[/math]-smooth and [math]\alpha[/math]-strongly convex.

**Definition 1**

A convex continuously differentiable function [math]f : K \to \mathbb{R}[/math] is [math]\alpha[/math]-strongly convex for [math]\alpha \gt 0[/math] if \begin{align*} f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2 \end{align*} [math] \forall x,y \in K [/math]. It is called [math]\beta[/math]-smooth for [math]\beta \gt 0[/math] if \begin{align*} f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2 \end{align*} [math] \forall x,y \in K [/math]

We remark that if [math]f[/math] is twice continuously differentiable, then this is simply equivalent to the eigenvalues of the Hessian matrix [math]\textbf{H}(f)[/math] being bounded between [math]\alpha[/math] and [math]\beta[/math]. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.

In convex analysis, one usually says that a function has condition number [math]Q[/math] if it is both [math]\alpha[/math]-strongly convex, and [math]\beta[/math]-smooth, and [math]\frac{\beta}{\alpha} \leq Q[/math]. The authors of this paper consider the more general case where [math]f[/math] is a monotone transformation of an [math]\alpha[/math]-strongly convex and [math]\beta[/math]-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.

### Zeroth-Order Optimisation

In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available, or they may be difficult to compute, such as:

- Combinatorial (i.e. discrete) optimisation
- Instances of non-analytic loss functions (e.g. hyperparameter tuning)
- Adversarial attacks
- Reinforcement learning

Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.

This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a [math]O( k Q \log (n / \epsilon ))[/math] convergence bound. Here the latent dimension is [math] k [/math] and [math] k \lt n [/math], where [math] n [/math] is the input dimension.

## GradientLess Descent Algorithm

The proposed algorithm is given in the picture below.

Observe that at each step, we perform a binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.

### Proof of correctness

The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.

**Theorem 1**

Let [math]B_1, B_2 \subseteq \mathbb{R}^n[/math] be balls of radii [math]r_1, r_2[/math]. Let [math]\ell[/math] be the distance between the centres. If [math]r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right][/math] and [math]r_2 \geq \ell - \frac{\ell}{4n}[/math], then [math]\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)[/math], where [math]c_n \geq \frac{1}{4}[/math].

Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.

**Theorem 2**

[math] \forall x \in K[/math] s.t. [math]\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2][/math], we can find integers [math]0 \leq k_1, k_2 \lt \log \frac{C_2}{C_1}[/math] such that if [math]r = 2^{k_1}C_1[/math] or [math]r = 2^{-k_2}C_2[/math], then a sample [math]y[/math] from the uniform distribution on [math]B_x = B\left( x, \frac{r}{\sqrt{n}} \right) [/math] satisfies \begin{align*} f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right) \end{align*} with probability at least [math]\frac{1}{4}[/math].

Notice how the second theorem implies that with a quarter probability, [math]f(y)[/math] is closer to the optimum,[math] f(x^*), [/math] than [math]f(x)[/math] is.

For proof of these theorems, please watch my talk.

In the current form of GradientLess Descent Algorithm presented here, the lower and upper limits of the search radius i.e. [math][r, R][/math] remain unchanged for the entire run of the algorithm. As proven by the correctness of this algorithm, this does ensure convergence but this version of the algorithm does not take advantage of the upper bound of the condition number [math]Q[/math] and therefore, has an extra factor of [math]\log \frac{1}{\epsilon}[/math] in its overall cost.

A variation of this algorithm termed **Gradientless Descent with Fast Binary Search (GLD-Fast)**, eliminates this additional factor from the overall cost through reduction in the range of the binary search by shrinking [math]R[/math] in half after every [math]H[/math] iterations (where [math]H[/math] is determined by [math]Q[/math]).

For determining K and H, use the following equations:

K = log(4√Q)

H = nQ log(Q)

## Results

We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.

For this comparison, we defined the function [math]f(x) = \frac{1}{2} x^T H x [/math] where [math]H[/math] is a diagonal matrix with eigenvalues linearly interpolating the interval [math][\alpha , \beta][/math]. We observe that in most scenarios, GradientLess Descent beats the benchmark.

## Conclusion

This research paper has analysed a randomised algorithm where a search direction is sampled from the standard Gaussian. This is a direct search-based algorithm, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search all the possible values of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.

## Critiques

1- Although the theoretical guarantees presented in the paper are interesting, this is not clear how this algorithm is applicable in practice. This is because this paper assumes we do not have access to the objective function, and they are only able to use function evaluations. Besides, there is a strong assumption that the function is smooth and strongly convex. Considering this, my main concern is how we can make sure the objective function is smooth and strongly convex while we do not have access to it explicitly? (if we have explicit access to the function and this is smooth and strongly convex, why shouldn't we use gradient-based methods?!) Further, what happens if the objective function violates the smooth and strongly convex condition? Can we still employ this algorithm?

In response to the above comments:

This algorithm has many practical applications especially in the field of reinforcement learning. A major concept in reinforcement learning is the concept of a reward function (which we either wish to minimise or maximise). In particular, the reward function may be hidden behind a black box. For example, consider a "theoretical" slot machine where we only see how much money we get if we win, but do not know how the amount is determined. It is true that in general, these objective functions may not be smooth or strongly convex, but one is able to either make certain assumptions about the reward function or relax certain conditions about the state of the world in order to create a reward function that is smooth or convex. Additionally, certain gradients may not have an analytical form, in which case numerical calculation for gradients may be computationally expensive. This method allows a way to bypass the gradient computations altogether!

To back the response:
They have demonstrated that their algorithm can be successfully applied to **MuJoCo** benchmarks, where the objective function is **not** strongly convex and smooth.
- providing more graphical representation in proving lemmas, would make the paper more fathomable.

2 - Currently only a single synthetic function is observed and experimented within the paper ( It is also under a single monotone exponential transformation ). Using a different function we would be sure the algorithm works well in practice.

## Bibliography

1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimisation". In: arXiv preprint arXiv:1911.06317 (2019).

2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.

3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimisation of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.

4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.