On Convexity, Smoothness, and Strong Convexity

The purpose of this note is to be a single point of reference for all the facts I find useful about each of the three function classes. A lot of the properties of each that are leveraged when analyzing the convergence rates of iterative optimizers on them are similar to each other, and I think it’s worth listing them all here in one place so that I (and the reader) can understand

What the general “shape” of a property is
How it changes as we move between the classes
Optionally, what it buys us for rate analysis.
I’m hoping this note will also help build intuition about which properties are worth taking as starting definitions of the function class.

The Starting Point: First-Order Approximation Error

As with most things optimization, we open with the first-order Taylor approximation to a function. Here, we explicitly compute the error between the true function value (at some $y$ ) and its approximation (at some $x$ ).

D_{f} (y, x) ≜ f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩

Each of the three classes makes a claim about $D_{f} (x, y)$ :

Convexity - $D_{f} (y, x) \geq 0$ : The tangent plane always lies below the function, which should sit well with our image of convex functions.
$μ -$ strongly convex - $D_{f} (y, x) \geq \frac{μ}{2} ∣∣ y - x ∣ ∣^{2}$ : The gap between the tangent plane and the function grows quadratically, a stronger lower-bound than the convexity function. This should induce an image of a function growing much quicker than a convex one.
$L -$ smooth - $D_{f} (x, y) \leq \frac{L}{2} ∣∣ y - x ∣ ∣^{2}$ : While convexity tells you how bad the Taylor approximation is, $L -$ smoothness gives you a picture of how good it can be, since the error is lesser than some quadratic (it’s still quadratic, but it’s something).
Thus, for an $L -$ smooth, $μ -$ strongly-convex function, we have something like the following:

f (x) + \nabla f (x)^{T} (y - x) + \frac{μ}{2} ∣∣ y - x ∣ ∣^{2} \leq f (y) \leq f (x) + \nabla f (x)^{T} (y - x) + \frac{L}{2} ∣∣ y - x ∣ ∣^{2}

Hessian Properties

Class	Hessian Condition
Convex	$\nabla^{2} f (x) ⪰ 0$
$μ -$ strongly convex	$\nabla^{2} f (x) ⪰ μ I$
$L -$ smooth	$\nabla^{2} f (x) ⪯ L I$
Convex + $L -$ smooth	$0 ⪯ \nabla^{2} f (x) ⪯ L I$
$μ -$ strongly convex + $L -$ smooth	$μ I ⪯ \nabla^{2} f (x) ⪯ L I$
Proofs for these properties starting from the gradient-geometry definitions can be found [[hessian-1	here]].

Gradient Geometry

For all the gradient-geometry properties, we work with the following inner product:

⟨ y - x, \nabla f (y) - \nabla f (x)⟩

Each of the following properties can be arrived at by starting from the Taylor Deviation definitions.
Convexity:

f (y) - f (x) \geq ⟨ \nabla f (x), y - x ⟩

Swapping $x$ and $y$ , we get

f (x) - f (y) \geq ⟨ \nabla f (y), x - y ⟩

Adding those two gives us $⟨ \nabla f (x) - \nabla f (y), y - x ⟩ \leq 0$ or $⟨ y - x, \nabla f (y) - \nabla f (x)⟩ \geq 0$

$μ$ -Strong Convexity:

f (y) - f (x) \geq ⟨ \nabla f (x), y - x ⟩ + \frac{μ}{2} ∣∣ y - x ∣ ∣^{2}

Swapping $x$ and $y$ , we get

f (x) - f (y) \geq ⟨ \nabla f (y), x - y ⟩ + \frac{μ}{2} ∣∣ y - x ∣ ∣^{2}

Adding and simplifying gives us $⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq μ ∣∣ x - y ∣ ∣^{2}$

$L -$ smoothness
A similar procedure as the above two, with the corresponding $L -$ smoothness deviation bound gives us $⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \leq L ∣∣ x - y ∣ ∣^{2}$

Gradient Growth Bounds

For each of these, we can start with the gradient geometry properties and make statements about how the gradient norm difference grows with growth in input-space:

$μ -$ Strong Convexity:
A simple application of Cauchy-Schwartz gives us

∣∣\nabla f (x) - \nabla f (y) ∣∣ \cdot ∣∣ x - y ∣∣ \geq ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ \geq μ ∣∣ x - y ∣ ∣^{2}

And thus, we have that $∣∣\nabla f (y) - \nabla f (x) ∣∣ \geq μ ∣∣ y - x ∣∣$ .

Unfortunately, for the convex case all we end up getting following the same procedure is that the gradient-change norm is greater than zero, which is obviously true.

Even worse, for the $L -$ smooth case, there is no bound we can arrive at, since a (legal) application of Cauchy-Schwartz to the gradient-geometry inequality would give us $- ∣∣\nabla f (x) - \nabla f (y) ∣∣ \leq L ∣∣ x - y ∣∣$ , which is worse than the obviously true $\geq 0$ bound.
Clearly, we need some other property to use as the starting point for proving a meaningful gradient growth bound for smooth functions.

We’ll leverage the fact that the Taylor deviation bound for smooth functions is actually a two-sided bound:¹

- \frac{L}{2} ∣∣ y - x ∣ ∣^{2} \leq D_{f} (y, x) \leq \frac{L}{2} ∣∣ y - x ∣ ∣^{2}

Or equivalently, that $∣ v^{T} \nabla^{2} f (x) v ∣ \leq L ∣∣ v ∣ ∣^{2}$ for any $v$ .²
Applying the Fundamental Theorem of Calculus to the gradients:

\nabla f (y) - \nabla f (x) = \int_{0}^{1} \nabla^{2} f (x + r t) r d t

where $r = y - x$ . Taking the $ℓ_{2}$ norm of both sides and applying Jensen’s inequality gives us

∣∣\nabla f (y) - \nabla f (x) ∣∣ \leq \int_{0}^{1} ∣∣ \nabla^{2} f (x + r t) r ∣∣ d t \leq \int_{1}

Generally we don’t worry about the lower-bound because often, we work with functions that are at-least convex as well (on top of being smooth), and so the lower-bound on $D_{f} (y, x)$ is $0$ , which is tighter than what smoothness alone gives us. However, in this case, we want to prove something that holds for all $L -$ smooth functions, and so we cannot use a convexity lower bound. ↩
So we’re actually starting from the Hessian bound on smooth functions. ↩

Pranav's Notes

Explorer

On Convexity, Smoothness, and Strong Convexity

The Starting Point: First-Order Approximation Error

Hessian Properties

Gradient Geometry

Gradient Growth Bounds

Graph View

Pranav's Notes

Explorer

On Convexity, Smoothness, and Strong Convexity

The Starting Point: First-Order Approximation Error

Hessian Properties

Gradient Geometry

Gradient Growth Bounds

Footnotes

Graph View