Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Big Idea

Newton’s method converges quadratically, but only if you start close enough. Globalization modifies Newton to converge from farther away by controlling step sizes. The two main approaches differ in what they monitor: the residual F(x)\|\mathbf{F}(\mathbf{x})\| (backward error) or the Newton correction Δx\|\Delta\mathbf{x}\| (forward error estimate).

The Problem with Pure Newton

Newton’s method has a fundamental tension:

Example 1 (Newton Diverging)

Consider f(x)=arctan(x)f(x) = \arctan(x) with root at x=0x^* = 0.

Newton’s method: xk+1=xk(1+xk2)arctan(xk)x_{k+1} = x_k - (1+x_k^2)\arctan(x_k)

Starting from x0=2x_0 = 2: x13.5x_1 \approx -3.5, x210.2x_2 \approx 10.2, x3x_3 \to \infty. The full Newton step overshoots dramatically.

Source
<Figure size 900x450 with 1 Axes>
Newton iterates:
   k         x_k    |f(x_k)|   |f/f'|(x_k)
   0      2.0000    1.11e+00      5.54e+00
   1     -3.5357    1.30e+00      1.75e+01
   2     13.9510    1.50e+00      2.93e+02
   3   -279.3441    1.57e+00      1.22e+05

The iterates diverge: |x_k| grows at each step.
The residual |f(x_k)| also grows (though bounded by pi/2 for arctan).
The Newton correction |f/f'| grows even faster, reflecting the true error.

Quantitative Convergence: The Newton-Kantorovich Theorem

The local convergence theorem guarantees convergence if x0\mathbf{x}_0 is “close enough” to x\mathbf{x}^*. But how close is close enough? The Newton-Kantorovich theorem answers this using only information available at the initial point.

Theorem 1 (Newton-Kantorovich Theorem)

Let F\mathbf{F} be continuously differentiable with:

  • DF(x0)D\mathbf{F}(\mathbf{x}_0) nonsingular with [DF(x0)]1β\|[D\mathbf{F}(\mathbf{x}_0)]^{-1}\| \leq \beta

  • [DF(x0)]1F(x0)η\|[D\mathbf{F}(\mathbf{x}_0)]^{-1}\mathbf{F}(\mathbf{x}_0)\| \leq \eta (size of first Newton step)

  • DF(x)DF(y)Lxy\|D\mathbf{F}(\mathbf{x}) - D\mathbf{F}(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\| (Lipschitz constant)

If h=βηL12h = \beta\eta L \leq \frac{1}{2}, then Newton’s method converges to a root x\mathbf{x}^* with:

xx0112hhη\|\mathbf{x}^* - \mathbf{x}_0\| \leq \frac{1 - \sqrt{1 - 2h}}{h}\eta

The condition h1/2h \leq 1/2 combines three quantities:

  • How small the initial step is (η\eta)

  • How sensitive the Jacobian is to changes (LL)

  • How large the Jacobian inverse is (β\beta)

Small hh means we’re in the “safe zone” for convergence. All three quantities can be computed (or bounded) at the starting point, without knowing the root.

The Kantorovich theorem tells us when convergence is guaranteed. When the conditions are not satisfied, we need strategies to make the method converge from farther away.

The fix is visible in the plot above: the tangent line is a good local approximation, but the full step overshoots because arctan\arctan flattens out far from the root. If we had taken only a fraction of the Newton step, the iterate would have moved toward zero instead of away from it. This motivates damped Newton: instead of taking the full step Δx\Delta\mathbf{x}, take a fraction:

xk+1=xk+λkΔxk\mathbf{x}_{k+1} = \mathbf{x}_k + \lambda_k \Delta\mathbf{x}_k

where λk(0,1]\lambda_k \in (0, 1] is the damping factor. The question is how to choose λk\lambda_k, and this is where the two approaches diverge.

The simplest strategy: define a merit function that measures progress, and require each step to decrease it.

For root-finding, the natural merit function is the squared residual:

ϕ(x)=12F(x)22\phi(\mathbf{x}) = \frac{1}{2}\|\mathbf{F}(\mathbf{x})\|_2^2

Recall from the error analysis that the residual F(x^)\|\mathbf{F}(\hat{\mathbf{x}})\| is the backward error: it measures how well x^\hat{\mathbf{x}} satisfies the equation. So backtracking on ϕ\phi amounts to requiring each step to reduce the backward error.

The gradient of ϕ\phi is ϕ(x)=DF(x)TF(x)\nabla\phi(\mathbf{x}) = D\mathbf{F}(\mathbf{x})^T\mathbf{F}(\mathbf{x}), connecting root-finding to optimization.

The Armijo Condition

Start with λ=1\lambda = 1 (full Newton step) and reduce until we see sufficient decrease in ϕ\phi.

Definition 1 (The Armijo Condition)

Accept step size λ\lambda if:

ϕ(xk+λΔx)ϕ(xk)+cλϕ(xk)TΔx\phi(\mathbf{x}_k + \lambda\Delta\mathbf{x}) \leq \phi(\mathbf{x}_k) + c\lambda \nabla\phi(\mathbf{x}_k)^T\Delta\mathbf{x}

where c(0,1)c \in (0, 1) is a small constant (typically c=104c = 10^{-4}).

The right-hand side is a linear prediction of ϕ\phi along the search direction, scaled by cc. Since the Newton direction is a descent direction (ϕTΔx<0\nabla\phi^T\Delta\mathbf{x} < 0 when DFD\mathbf{F} is nonsingular), this line has negative slope. The small cc makes the condition easy to satisfy: we just need some decrease, not optimal decrease.

Typical values: c=104c = 10^{-4}, ρ=0.5\rho = 0.5.

Properties

Theorem 2 (Convergence of Backtracking Newton)

Under mild conditions on F\mathbf{F}, damped Newton with Armijo backtracking satisfies:

limkϕ(xk)=0\lim_{k \to \infty} \|\nabla\phi(\mathbf{x}_k)\| = 0

If additionally the Jacobian is nonsingular at limit points, the method converges to a root.

Proof 1

Each step decreases ϕ\phi by at least a fixed fraction (the Armijo condition guarantees this). Since ϕ0\phi \geq 0, the decreases must eventually become small. This happens only when ϕ0\nabla\phi \to 0, which for our merit function means DFTF0D\mathbf{F}^T\mathbf{F} \to 0.

Remark 1 (What “Global Convergence” Actually Means)

This result is sometimes called “global convergence,” meaning convergence from any starting point (as opposed to Newton’s local convergence which requires starting near the root). However, the guarantee is weaker than it sounds: ϕ0\nabla\phi \to 0 means DFTF0D\mathbf{F}^T\mathbf{F} \to 0, which could be satisfied at a stationary point of ϕ\phi where F>0\|\mathbf{F}\| > 0 (a local minimum of the residual that is not a root). True convergence to a root requires the additional condition that DFD\mathbf{F} is nonsingular at the limit point.

From Armijo to NLEQ-RES

The simple Armijo backtracking above halves λ\lambda blindly until sufficient decrease is achieved. Deuflhard’s NLEQ-RES refines this with the same adaptive prediction machinery used by NLEQ-ERR (described below), but applied to the residual. The prediction uses a quadratic model along the Newton direction:

μk=12F(xk)λk2F(xtrial)(1λk)F(xk)\mu_k = \frac{\frac{1}{2}\|\mathbf{F}(\mathbf{x}_k)\|\lambda_k^2}{\|\mathbf{F}(\mathbf{x}^{\text{trial}}) - (1 - \lambda_k)\mathbf{F}(\mathbf{x}_k)\|}

The reduced damping factor is λkmin(μk,λk/2)\lambda_k \gets \min(\mu_k, \lambda_k / 2), and the monotonicity test can be either standard (θ<1\theta < 1) or restricted (θ<1λ/4\theta < 1 - \lambda/4). This gives NLEQ-RES faster convergence than naive backtracking while still monitoring the residual.

Limitations

Despite these improvements, damping based on F\|\mathbf{F}\| has a subtle flaw: it is not affine invariant. To understand why this matters, we first need to define what affine invariance means.

Affine Invariance

Definition 2 (Affine Invariance)

An iterative method for solving F(x)=0\mathbf{F}(\mathbf{x}) = \mathbf{0} is affine invariant if it produces the same iterates (up to coordinate change) when applied to the transformed problem F~(x)=AF(x)\tilde{\mathbf{F}}(\mathbf{x}) = A\mathbf{F}(\mathbf{x}) for any nonsingular matrix AA.

Why should we care? The matrix AA represents an arbitrary scaling or rotation of the equations. If we multiply the first equation by 1000, the mathematical problem is the same, but a non-invariant method may behave completely differently. An affine invariant method is insensitive to such choices.

Remark 2 (Newton’s Method is Affine Invariant)

Pure Newton’s method (with λ=1\lambda = 1) is affine invariant. Under F~=AF\tilde{\mathbf{F}} = A\mathbf{F}:

DF~1F~=(ADF)1(AF)=DF1FD\tilde{\mathbf{F}}^{-1}\tilde{\mathbf{F}} = (AD\mathbf{F})^{-1}(A\mathbf{F}) = D\mathbf{F}^{-1}\mathbf{F}

The Newton correction Δx=DF1F\Delta\mathbf{x} = -D\mathbf{F}^{-1}\mathbf{F} is unchanged. But the residual norm F~=AFF\|\tilde{\mathbf{F}}\| = \|A\mathbf{F}\| \neq \|\mathbf{F}\| in general.

So Newton itself is affine invariant, but Armijo backtracking on F\|\mathbf{F}\| breaks this invariance.

Approach 2: Error-Oriented Step Control (NLEQ-ERR)

Deuflhard’s insight: instead of monitoring the residual F\|\mathbf{F}\| (backward error), monitor the Newton correction Δx\|\Delta\mathbf{x}\| (forward error estimate). Since the Newton correction is affine invariant, this preserves the invariance of the overall method.

The Natural Level Function

The natural level function for error-oriented Newton is:

h(x)=DF(x)1F(x)=Δxh(\mathbf{x}) = \|D\mathbf{F}(\mathbf{x})^{-1}\mathbf{F}(\mathbf{x})\| = \|\Delta\mathbf{x}\|

To see why this estimates the forward error, note that by the mean value theorem:

F(x)=F(x)F(x)=DF(ξ)(xx)\mathbf{F}(\mathbf{x}) = \mathbf{F}(\mathbf{x}) - \mathbf{F}(\mathbf{x}^*) = D\mathbf{F}(\xi)(\mathbf{x} - \mathbf{x}^*)

for some ξ\xi between x\mathbf{x} and x\mathbf{x}^*. Inverting gives:

xx=DF(ξ)1F(x)DF(x)1F(x)=Δx\mathbf{x} - \mathbf{x}^* = D\mathbf{F}(\xi)^{-1}\mathbf{F}(\mathbf{x}) \approx D\mathbf{F}(\mathbf{x})^{-1}\mathbf{F}(\mathbf{x}) = -\Delta\mathbf{x}

So the Newton correction is literally the linearized estimate of the error xx\mathbf{x} - \mathbf{x}^*, and this estimate improves as xx\mathbf{x} \to \mathbf{x}^*. This is the same principle as the step size lemma from the scalar case: if the iteration is contractive with factor θ<1\theta < 1, then xkxΔxk/(1θ)\|\mathbf{x}_k - \mathbf{x}^*\| \leq \|\Delta\mathbf{x}_k\|/(1-\theta). The contraction factor θk\theta_k below plays exactly this role.

Monitoring Δx\|\Delta\mathbf{x}\| is the right choice because:

  1. It estimates what we actually care about (distance to the root).

  2. It is affine invariant (AA cancels in DF1FD\mathbf{F}^{-1}\mathbf{F}).

  3. Near the root, Δx0\|\Delta\mathbf{x}\| \to 0 is exactly the convergence criterion.

The Contraction Factor

After a damped step xk+1=xk+λkΔxk\mathbf{x}_{k+1} = \mathbf{x}_k + \lambda_k\Delta\mathbf{x}_k with λk(0,1]\lambda_k \in (0, 1], compute the simplified Newton correction at the trial point using the old Jacobian:

Δxk+1=DF(xk)1F(xk+1)\overline{\Delta\mathbf{x}}_{k+1} = -D\mathbf{F}(\mathbf{x}_k)^{-1}\mathbf{F}(\mathbf{x}_{k+1})

The contraction factor is:

θk=Δxk+1Δxk\theta_k = \frac{\|\overline{\Delta\mathbf{x}}_{k+1}\|}{\|\Delta\mathbf{x}_k\|}

This measures how much the error estimate shrank in one step. From the Banach fixed point theorem, convergence requires θk<1\theta_k < 1 (the iteration must be a contraction). Deuflhard uses a restricted monotonicity test:

θk<1λk4\theta_k < 1 - \frac{\lambda_k}{4}

which is stronger than θk<1\theta_k < 1 and ensures well-controlled convergence.

Adaptive Step Size Prediction

A key feature of NLEQ_ERR: predict the optimal λ\lambda rather than just halving when the test fails.

Two predictions are used. Within the damping loop (when a trial step is rejected), the prediction is based on the deviation between the actual and predicted corrections:

μkdamp=12Δxkλk2Δxk+1(1λk)Δxk\mu_k^{\text{damp}} = \frac{\frac{1}{2}\|\Delta\mathbf{x}_k\|\lambda_k^2}{\|\overline{\Delta\mathbf{x}}_{k+1} - (1 - \lambda_k)\Delta\mathbf{x}_k\|}

The reduced damping factor is λkmin(μkdamp,λk/2)\lambda_k \gets \min(\mu_k^{\text{damp}}, \lambda_k / 2).

Between iterations (once a step is accepted), the prediction for the next iteration uses the ratio of successive corrections:

μknext=Δxk1ΔxkΔxkΔxkΔxkλk\mu_k^{\text{next}} = \frac{\|\Delta\mathbf{x}_{k-1}\| \cdot \|\overline{\Delta\mathbf{x}}_k\|}{\|\overline{\Delta\mathbf{x}}_k - \Delta\mathbf{x}_k\| \cdot \|\Delta\mathbf{x}_k\|} \cdot \lambda_k

The next damping factor is λk+1=min(1,μknext)\lambda_{k+1} = \min(1, \mu_k^{\text{next}}). Both predictions adapt aggressively: when the linearization is accurate (μ\mu is large), λ\lambda ramps up quickly toward 1. When it is poor, λ\lambda stays small.

An additional safeguard: if the predicted λ\lambda would increase by a factor of 4 or more and no reduction occurred in the damping loop, the algorithm retries with the larger λ\lambda before accepting the current step. This avoids accepting an overly conservative step when the prediction says a much larger step would succeed.

The Algorithm

Algorithm 2 (NLEQ_ERR (Simplified))

Input: F\mathbf{F}, DFD\mathbf{F}, initial guess x0\mathbf{x}_0, tolerance ε\varepsilon

Output: Approximate root x\mathbf{x}

  1. Choose initial λ0(0,1]\lambda_0 \in (0, 1] (based on nonlinearity estimate: λ0=1\lambda_0 = 1 for mildly nonlinear, λ0=102\lambda_0 = 10^{-2} for highly nonlinear, λ0=104\lambda_0 = 10^{-4} for extremely nonlinear)

  2. for k=0,1,2,k = 0, 1, 2, \ldots:

  3. \qquad Solve DF(xk)Δxk=F(xk)D\mathbf{F}(\mathbf{x}_k)\Delta\mathbf{x}_k = -\mathbf{F}(\mathbf{x}_k)

  4. \qquad if Δxk<ε\|\Delta\mathbf{x}_k\| < \varepsilon: return xk+Δxk\mathbf{x}_k + \Delta\mathbf{x}_k

  5. \qquad Damping loop:

  6. \qquad\qquad xtrialxk+λkΔxk\mathbf{x}^{\text{trial}} \gets \mathbf{x}_k + \lambda_k\Delta\mathbf{x}_k

  7. \qquad\qquad Solve DF(xk)Δx=F(xtrial)D\mathbf{F}(\mathbf{x}_k)\overline{\Delta\mathbf{x}} = -\mathbf{F}(\mathbf{x}^{\text{trial}}) (reuse Jacobian)

  8. \qquad\qquad Compute θk=Δx/Δxk\theta_k = \|\overline{\Delta\mathbf{x}}\|/\|\Delta\mathbf{x}_k\|

  9. \qquad\qquad if θk<1λk/4\theta_k < 1 - \lambda_k/4: accept step, predict λk+1\lambda_{k+1} from μk\mu_k

  10. \qquad\qquad else: λkmin(μk,λk/2)\lambda_k \gets \min(\mu_k, \lambda_k/2), repeat damping loop

  11. \qquad xk+1xtrial\mathbf{x}_{k+1} \gets \mathbf{x}^{\text{trial}}

Note that step 7 reuses the Jacobian from step 3. The factored Jacobian is already available, so computing Δx\overline{\Delta\mathbf{x}} requires only a forward/backward substitution (cheap compared to a new factorization).

Theorem 3 (Local Quadratic Convergence Recovery)

For both methods: if xkx\mathbf{x}_k \to \mathbf{x}^* where F(x)=0\mathbf{F}(\mathbf{x}^*) = \mathbf{0} and DF(x)D\mathbf{F}(\mathbf{x}^*) is nonsingular, then λk=1\lambda_k = 1 for all kk sufficiently large. Quadratic convergence is recovered near the root.

Proof 2

Near the solution, the Newton step becomes very accurate. For backtracking, the full step easily satisfies the Armijo condition. For NLEQ_ERR, the contraction factor θk0\theta_k \to 0 (since Newton converges quadratically), so the monotonicity test is satisfied with λ=1\lambda = 1.

Visualization: Newton Fractals

The difference between these approaches is dramatically visible in the complex plane. Applying Newton’s method to f(z)=z31f(z) = z^3 - 1 from a grid of starting points and coloring by which root the iteration converges to produces the classic Newton fractal. Backtracking on f(z)|f(z)| does not eliminate the fractal (because f|f| has multiple valleys), while NLEQ-ERR’s error-oriented monitoring significantly smooths the basin boundaries.

See Also

Newton’s Method in the Complex Plane: interactive comparison of standard Newton, residual-based damping (NLEQ-RES), and error-oriented damping (NLEQ-ERR) applied to z31=0z^3 - 1 = 0.

Iteration Trajectories

The following example shows the iteration paths for all three methods applied to f(z)=z31f(z) = z^3 - 1 from z0=0.35+0.22iz_0 = 0.35 + 0.22i, a starting point near a basin boundary where pure Newton bounces erratically before recovering. The left panel shows the trajectories in the complex plane overlaid on the Newton correction landscape f(z)/f(z)|f(z)/f'(z)|. The right panel compares the convergence histories.

Source
<Figure size 1000x1200 with 4 Axes>

Iteration trajectories for f(z)=z31f(z) = z^3 - 1 from z0=0.35+0.22iz_0 = 0.35 + 0.22i. Top: Paths in the complex plane overlaid on the Newton correction landscape f/f|f/f'| (darker = closer to a root). Pure Newton (red) bounces between all three basins before settling. Simple Armijo backtracking (orange) damps by halving λ\lambda until f|f| decreases. NLEQ-RES (black) uses Deuflhard’s residual-based damping with predicted step sizes. NLEQ-ERR (blue) damps based on the correction norm. Bottom left: The residual f(zk)|f(z_k)| (backward error). Bottom right: The Newton correction f(zk)/f(zk)|f(z_k)/f'(z_k)| (forward error estimate, matching the landscape above). The damped methods converge monotonically while undamped Newton oscillates.

Reference: P. Deuflhard, Newton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms, Springer, 2011.