{"id":2278,"date":"2025-10-04T23:48:02","date_gmt":"2025-10-04T21:48:02","guid":{"rendered":"https:\/\/mpr-projects.com\/?p=2278"},"modified":"2025-10-04T23:48:02","modified_gmt":"2025-10-04T21:48:02","slug":"computer-vision-non-linear-optimization-methods","status":"publish","type":"post","link":"https:\/\/mpr-projects.com\/index.php\/2025\/10\/04\/computer-vision-non-linear-optimization-methods\/","title":{"rendered":"Computer Vision: Non-Linear Optimization Methods"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">This is a brief summary of some of the optimization methods used in Computer Vision topics like Camera Calibration and 3D-Reconstructions.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;re going to start with a relatively simple method and work our way to the more complicated but also a lot more powerful method that&#8217;s actually used in computer vision topics.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Newton&#8217;s Method<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The simplest method we can use is Newton&#8217;s method. We first look at it in 1 dimension, then in higher dimensions and finally we consider some extensions.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Newton&#8217;s Method in 1D<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Given a function<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f: \\mathbb{R} \\rightarrow \\mathbb{R}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">find<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\min_{x \\in \\mathbb{R}} f(x).$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To solve this problem we look for the offset $dx$ that minimizes $f(x + dx)$. We start by writing the Taylor expansion of $f(x + dx)$,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f(x + dx) = f(x) + f'(x) dx + \\frac{1}{2} f&#8221;(x) dx^2 + O(dx^3).$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Taylor expansion can be written with (infinitely) many terms but in Newton&#8217;s method we only keep the first three terms (and we assume the others are negligible). To minimize this function we first take its derivative,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\frac{d}{dx} f(x + dx) = f'(x) + f&#8221;(x) dx$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">and set it to zero<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f'(x) + f&#8221;(x) dx = 0.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Solving for $dx$ gives<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ dx = &#8211; \\frac{f'(x)}{f&#8221;(x)}.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By moving $x$ in the direction of $dx$ we change the function value. If $f&#8221;(x)$ is positive then $f(x + dx) &lt; f(x)$ (unless we overshoot), if it&#8217;s negative then $f(x + dx) > f(x)$ and if it&#8217;s zero then we&#8217;re in a locally linear section of $f$ and we can&#8217;t actually compute $dx$. So for a minimization method we ideally want to have $f&#8221;(x) > 0$<sup data-fn=\"37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55\" class=\"fn\"><a href=\"#37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55\" id=\"37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55-link\">1<\/a><\/sup>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After we&#8217;ve computed $dx$ we evaluate $f$ at $x + dx$, but we&#8217;re most probably not at the minimum of $f$. And that&#8217;s because we used the information $f'(x)$ and $f&#8221;(x)$ at point $x$ to compute the offset. But the function value changes as soon as we move away from $x$ and thus the first and second derivatives at $x$ will no longer be accurate. So what we computed above is really just an estimate. To find a minimum we have to repeat the same process again, so we start at some estimate $x_0$ and we compute<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$x_{k+1} = x_{k}\\ -\\ \\frac{f'(x_k)}{f&#8221;(x_k)}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">until <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\left| \\frac{f'(x_k)}{f&#8221;(x_k)} \\right| \\approx 0.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Following this recipe allows us to find a local optimum (unless we get stuck in a locally linear section). There is no guarantee that we&#8217;ll find the global optimum.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Newton&#8217;s Method in ND<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">In higher dimensions we have<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f: \\mathbb{R}^N \\rightarrow \\mathbb{R}.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first derivative is replaced by the gradient and the second by the Hessian matrix,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$<br>\\nabla f =<br>\\begin{bmatrix}<br>\\frac{\\partial f}{\\partial x_1} \\\\<br>\\frac{\\partial f}{\\partial x_2} \\\\<br>\\vdots \\\\<br>\\frac{\\partial f}{\\partial x_N}<br>\\end{bmatrix},<br>\\quad<br>\\nabla^2 f =<br>\\begin{bmatrix}<br>\\frac{\\partial^2 f}{\\partial x_1^2} &amp; \\frac{\\partial^2 f}{\\partial x_1 \\partial x_2} &amp; \\cdots &amp; \\frac{\\partial^2 f}{\\partial x_1 \\partial x_N} \\\\<br>\\frac{\\partial^2 f}{\\partial x_2 \\partial x_1} &amp; \\frac{\\partial^2 f}{\\partial x_2^2} &amp; \\cdots &amp; \\frac{\\partial^2 f}{\\partial x_2 \\partial x_N} \\\\<br>\\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\<br>\\frac{\\partial^2 f}{\\partial x_N \\partial x_1} &amp; \\frac{\\partial^2 f}{\\partial x_N \\partial x_2} &amp; \\cdots &amp; \\frac{\\partial^2 f}{\\partial x_N^2}<br>\\end{bmatrix}.<br>$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Taylor expansion becomes<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f(\\underline{x} + \\underline{p}) = f(\\underline{x}) + \\nabla f(\\underline{x})^T \\underline{p} + \\frac{1}{2} \\underline{p}^T \\nabla^2 f(\\underline{x}) \\underline{p}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">and its derivative becomes<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\frac{d}{d\\underline{p}} = \\nabla f(\\underline{x})+ \\nabla^2 f(\\underline{x}) \\underline{p}.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Setting it to zero and solving for $\\underline{p}$ gives<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\underline{x}_{k+1} = \\underline{x}_k\\ -\\ \\left[ \\nabla^2 f(\\underline{x}) \\right]^{-1} \\nabla f(\\underline{x}).$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This requires us to compute the Hessian matrix (and it must be invertible). To guarantee convergence to a local minimum the Hessian also needs to be positive-definite (i.e. all eigenvalues must be positive). If the Hessian is indefinite then we may converge to a saddle point or if it&#8217;s negative-definite then we may converge to a maximum. If it&#8217;s positive-definite but the eigenvalues are small then we may get huge or unstable steps.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Extensions<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">There are a few extensions to Newton&#8217;s method that make it more stable. I&#8217;m just putting them here so I can remember them, feel free to skip to the next method.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Line Search<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than taking a full step<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\underline{x}_{k+1} = \\underline{x}_k + \\underline{p}_k$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\underline{p}_k = &#8211; \\left[ \\nabla^2 f(\\underline{x}) \\right]^{-1} \\nabla f(\\underline{x})$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">we adjust the equation to<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\underline{x}_{k+1} = \\underline{x}_k + \\lambda \\underline{p}_k \\text{ for some } \\lambda \\in \\mathbb{R}.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our objective is to find a $\\lambda$ that prevents us from overshooting but that also doesn&#8217;t stall (become too small).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are different ways to find $\\lambda$. A simple way is to start with $\\lambda = 1$ and to check if the Armijo (and Wolfe) conditions are satisfied.<\/p>\n\n\n\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-container-core-group-is-layout-9875de16 wp-block-group-is-layout-constrained\" style=\"padding-right:var(--wp--preset--spacing--20);padding-left:var(--wp--preset--spacing--20)\">\n<p class=\"wp-block-paragraph\"><strong>Armijo:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f(\\underline{x}_k + \\lambda \\underline{p}) \\leq f(\\underline{x}_k) + c_1 \\lambda \\left[ \\nabla f(\\underline{x}_k) \\right]^T \\underline{p} \\quad \\text{with } c_1 \\in (0, 1).$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Define $\\phi(\\lambda) = f(\\underline{x}_k + \\lambda \\underline{p})$ with $\\phi'(\\lambda) = \\left[ \\nabla f(\\underline{x}_k + \\lambda \\underline{p}) \\right]^T \\underline{p})$. The right-hand-side (RHS) of the equation above is a linear prediction of the function value (using the slop at $\\lambda = 0$). The LHS is the actual value at that point. If the actual value is above the prediction (LHS) then our estimate is not accurate and we reduce $\\lambda$ and check again. This ensures that we take stable steps (i.e. we don&#8217;t overshoot).<\/p>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">$\\lambda$ can be reduced by multiplying it by some $\\beta \\in (0, 1)$. Armijo may lead to very small steps. This can be prevented by using Wolfe&#8217;s condition. However, that would require a more complicated algorithm, which I won&#8217;t cover here.<\/p>\n\n\n\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-container-core-group-is-layout-9875de16 wp-block-group-is-layout-constrained\" style=\"padding-right:var(--wp--preset--spacing--20);padding-left:var(--wp--preset--spacing--20)\">\n<p class=\"wp-block-paragraph\"><strong>Wolfe:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ f'(\\underline{x}_k + \\lambda \\underline{p})^T \\underline{p} \\geq c_2 \\left[ \\nabla f \\right] \\underline{p} \\quad \\text{ for } c_2 \\in (c_1, 1). $$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The LHS is the slope at the new point. The RHS is the fraction of the slope at $\\underline{x}_k$.<\/p>\n<\/div>\n\n\n\n<h5 class=\"wp-block-heading\">Trust-Region<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">We approximate $f$ locally with a quadratic model<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ m_k(\\underline{p}) = f(\\underline{x}_k) + \\left[\\nabla f(\\underline{x}_k) \\right]^T \\underline{p} + \\frac{1}{2} \\underline{p}^T \\nabla^2 f(\\underline{x}_k) \\underline{p}.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we minimize<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\min_{\\underline{p}} m_k(\\underline{p}), \\quad \\text{ subject to } \\Vert \\underline{p} \\Vert \\leq \\Delta$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where $\\Delta$ is the trust-region radius. The update step $\\underline{p}_k$ is computed as before. If $\\Vert \\underline{p}_k \\Vert \\leq \\Delta_k$ then we take a full Newton step $\\underline{x}_{k+1} = \\underline{x}_k + \\underline{p}_k$. Otherwise the solution would be outside of the trust region. In that case we use an optimization method (e.g. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Powell%27s_dog_leg_method\" data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/Powell%27s_dog_leg_method\" target=\"_blank\" rel=\"noreferrer noopener\">dogleg<\/a>) to find the point on the boundary of the trust region that minimizes $m_k(\\underline{p})$.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we&#8217;ve got a candidate $\\underline{x}_{k+1}$ and we want to check if our quadratic model at $\\underline{x}_{k+1}$ is reliable. So we compute the predicted decrease of the function value $m_k(\\underline{0})\\ -\\ m_k(\\underline{p}_k)$ and the actual decrease $f(\\underline{x}_k)\\ -\\ f(\\underline{x}_k + \\underline{p}_k)$. If they are sufficiently similar then we trust the model, take the step, and increase $\\Delta$. Otherwise we don&#8217;t trust the model, reject the step and decrease $\\Delta$,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$<br>\\small<br>\\rho_k =<br>\\frac{m_k(\\underline{0})\\ -\\ m_k(\\underline{p}_k)}<br>{f(\\underline{x}_k)\\ -\\ f(\\underline{x}_k + \\underline{p}_k)},<br>\\quad<br>\\text{e.g.} \\quad<br>\\left\\{<br>\\begin{aligned}<br>\\rho_k &amp;&lt; \\tfrac{1}{4}<br>&amp;&amp;\\Rightarrow\\ \\text{reject},\\<br>   \\underline{x}_{k+1} = \\underline{x}_k,\\<br>   \\Delta_{k+1} = \\tfrac{1}{4}\\,\\Delta_k, \\\\[6pt]<br>\\rho_k &amp;\\ge \\tfrac{1}{4}<br>&amp;&amp;\\Rightarrow\\ \\text{accept},\\ <br>   \\underline{x}_{k+1} = \\underline{x}_k + \\underline{p}_k,\\ <br>   \\Delta_{k+1} = 2\\,\\Delta_k<br>\\end{aligned}<br>\\right.<br>$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This check also prevents us from taking a step in the wrong direction.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gauss-Newton Method<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">So far we&#8217;ve looked at $f(\\underline{x})$ without assuming any special structure. In the common case of least squares problems $f(\\underline{x})$ is the sum of squared residuals,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f( \\underline{x} ) = \\tfrac{1}{2} \\Vert r(\\underline{x}) \\Vert^2 = \\tfrac{1}{2} \\sum_{i=1}^M r_i^2(\\underline{x}), \\quad \\text{with } r: \\mathbb{R}^N \\rightarrow \\mathbb{R}^M.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this case we can specify the gradient,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$<br>\\nabla f =<br>\\begin{bmatrix}<br>\\frac{\\partial f}{\\partial x_1} \\\\<br>\\frac{\\partial f}{\\partial x_2} \\\\<br>\\vdots \\\\<br>\\frac{\\partial f}{\\partial x_N}<br>\\end{bmatrix}<br>=<br>\\begin{bmatrix}<br>\\frac{\\partial r_1}{\\partial x_1} &amp; \\frac{\\partial r_2}{\\partial x_1} &amp; \\cdots &amp; \\frac{\\partial r_M}{\\partial x_1} \\\\<br>\\frac{\\partial r_1}{\\partial x_2} &amp; \\frac{\\partial r_2}{\\partial x_2} &amp; \\cdots &amp; \\frac{\\partial r_M}{\\partial x_2} \\\\<br>\\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\<br>\\frac{\\partial r_1}{\\partial x_N} &amp; \\frac{\\partial r_2}{\\partial x_N} &amp; \\cdots &amp; \\frac{\\partial r_M}{\\partial x_N}<br>\\end{bmatrix}<br>\\begin{bmatrix}<br>r_1(x) \\\\<br>r_2(x) \\\\<br>\\vdots \\\\<br>r_M(x)<br>\\end{bmatrix}<br>=<br>J(\\underline{x})^{T} r(\\underline{x})<br>$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, the first line can be found as<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$<br>\\begin{aligned}<br>\\frac{\\partial f(\\underline{x})}{\\partial x_1}<br>&amp;= \\frac{\\partial}{\\partial x_1}<br>\\left(<br>\\tfrac{1}{2} r_1^2(\\underline{x}) +<br>\\tfrac{1}{2} r_2^2(\\underline{x}) +<br>\\cdots +<br>\\tfrac{1}{2} r_M^2(\\underline{x})<br>\\right) \\\\[8pt]<br>&amp;= r_1(\\underline{x}) \\frac{\\partial r_1}{\\partial x_1}<br>+r_2(\\underline{x}) \\frac{\\partial r_2}{\\partial x_1}<br>+\\cdots<br>+r_M(\\underline{x}) \\frac{\\partial r_M}{\\partial x_1} \\\\[8pt]<br>&amp;=<br>\\begin{bmatrix}<br>\\frac{\\partial r_1}{\\partial x_1} &amp;<br>\\frac{\\partial r_2}{\\partial x_1} &amp;<br>\\cdots &amp;<br>\\frac{\\partial r_M}{\\partial x_1}<br>\\end{bmatrix}<br>\\begin{bmatrix}<br>r_1(\\underline{x}) \\\\[4pt]<br>r_2(\\underline{x}) \\\\[2pt]<br>\\vdots \\\\[2pt]<br>r_M(\\underline{x})<br>\\end{bmatrix}<br>\\end{aligned}.<br>$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Hessian is <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$<br>\\begin{aligned}<br>\\nabla^2 f(\\underline{x}) &amp;= \\nabla \\left( \\nabla f(\\underline{x}) \\right) \\\\<br>&amp;= \\nabla \\left( J^T(\\underline{x}) r(\\underline{x}) \\right) \\\\<br>&amp;= J^T(\\underline{x}) \\nabla r(\\underline{x}) + \\nabla J^T(\\underline{x}) r(\\underline{x}) \\\\<br>&amp;= J^T(\\underline{x}) J(\\underline{x}) + \\sum_{i=1}^M r_i(\\underline{x}) \\nabla^2 r_i(\\underline{x}).<br>\\end{aligned}<br>$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If the residuals are small, i.e. if we&#8217;re close to a solution, then the last term in the Hessian is negligible and we can approximate the Hessian with<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\nabla^2 f(\\underline{x}) \\approx J^T(\\underline{x}) J(\\underline{x}),$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where $J^T J$ is always positive semi-definite. The Gauss-Newton method makes this approximation. The Taylor expansion is then<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$f(\\underline{x} + \\underline{p}) \\approx f(\\underline{x}) + J^T(\\underline{x}) r(\\underline{x}) \\cdot \\underline{p} + \\frac{1}{2} \\underline{p}^T J^T(\\underline{x}) J(\\underline{x}) \\underline{p}. $$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The derivative is computed as<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\frac{df}{d\\underline{p}} \\approx J^T \\underline{r} + J^T J \\underline{p}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">and the update rule is given by<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\underline{p} = -\\left( J^T(\\underline{x}) J(\\underline{x})\\right)^{-1} J^T(\\underline{x}) \\underline{r}.$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Newton vs Gauss-Newton<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s briefly compare how these methods differ.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Cost<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Newton requires the calculation of the Hessian which is $O(mn^2)$, Gauss-Newton only needs the gradient $O(mn)$, which is usually computed anyway.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Solving<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Gauss-Newton&#8217;s $J^T J p = &#8211; J^T \\underline{r}$ is equivalent to $\\min_{p} \\Vert J\\underline{p} &#8211; \\underline{r}\\Vert^2$, which can be solved efficiently with QR, SVD or other decompositions. Newton&#8217;s $\\nabla^2 f \\underline{p} = -J^T \\underline{r}$ can&#8217;t be solved that way. We have to use the Cholesky decomposition if $\\nabla^2f$ is semi-positive definite and the $LDL^2$ method otherwise.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Convergence<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Newton is quadratic (in theory), Gauss-Newton is (super-)linear. So if Newton works then it converges more quickly (although each step requires a lot more computation).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Issues with Gauss-Newton<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While Gauss-Newton is specialized for least-squares problems there are still some issues to consider:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Gauss-Newton drops the second term of the Hessian $\\sum_{i=1}^M r_i(\\underline{x}) \\nabla^2 r_i(\\underline{x})$. If we&#8217;re not close to the solution then this term may be large, which can cause the method to diverge.<\/li>\n\n\n\n<li>It requires $J(\\underline{x})$ to have full-rank, otherwise $J^T J$ may be singular and we won&#8217;t be able to compute the step (or steps may not be stable).<\/li>\n\n\n\n<li>Far from the solution we may not converge to any solution at all.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Levenberg-Marquardt<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">These issues are addressed in an extension of Gauss-Newton. It can be viewed in two ways, from a <em>damping-parameter<\/em> point of view and from a <em>trust-region<\/em> point of view.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Damping Parameter View<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Here we adjust the Gauss-Newton update by including a damping parameter $\\lambda$,<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\left( J^T(\\underline{x}) J(\\underline{x}) + \\lambda \\mathbf{I} \\right) \\underline{p} = &#8211; J^T(\\underline{x}) r(\\underline{x}) \\quad \\text{with } \\lambda \\geq 0.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If $\\lambda = 0$ then we take a pure Gauss-Newton step. If $\\lambda$ is large then we take a (small) step that&#8217;s similar to steepest gradient descent, $\\underline{p} = &#8211; \\frac{1}{\\lambda} J^T \\underline{r}$.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Algorithm goes as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Compute $J(\\underline{x}_k)$ and $r(\\underline{x}_k)$.<\/li>\n\n\n\n<li>Solve the equation above for $\\underline{p}$ and compute the candidate $\\underline{\\tilde{x}}_{k+1}= \\underline{x}_k + \\underline{p}$.<\/li>\n\n\n\n<li>Compute the actual and expected decrease in the function value and their ratio<br>$$ \\rho_k = \\frac{f(\\underline{x}_k)\\ -\\ f(\\underline{\\tilde{x}}_{k+1})}{\\frac{1}{2} \\Vert r(\\underline{x}_k) \\Vert^2\\ -\\ \\frac{1}{2} \\Vert r(\\underline{x}_k) + J^T(\\underline{x}) \\underline{p} \\Vert^2 }. $$<\/li>\n\n\n\n<li>If $\\rho_k$ is close to 1 then accept the candidate. So $\\underline{x}_{k+1} = \\underline{\\tilde{x}}_{k+1}$ and $\\lambda_{k+1} = \\beta_1 \\lambda_k$ with $\\beta_1 \\in (0, 1)$. Otherwise reject the candidate. So $\\underline{x}_{k+1} = \\underline{x}_k$ and $\\lambda_{k+1} = \\beta_2 \\lambda_k$ with $\\beta_2 > 1$.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This means that far from the solution Levenberg-Marquardt (LM) is similar to gradient descent and omitting the second term of the Hessian is not a problem. Close to the solution LM is similar to Gauss-Newton so it converges relatively quickly (super-linearly). As long as $\\lambda \\neq 0$, $J^T J + \\lambda \\mathbf{I}$ is positive-definite, so the problem can be solved.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Trust-Region View<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">LM can also be written as<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$\\min_{\\underline{p}} \\Vert J^T \\underline{p} + \\underline{r} \\Vert^2 \\quad \\text{subject to } \\Vert \\underline{p} \\Vert \\leq \\Delta.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, for the time being I&#8217;ll skip this view of LM (because I don&#8217;t feel like dealing with Lagrange multipliers right now, as I&#8217;m writing this ;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Footnotes:<\/p>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55\">It can happen that we first move towards a maximum, overshoot, and then converge to a minimum but in general if $f&#8221;(x) &lt; 0$ then we won&#8217;t converge to a minimum. <a href=\"#37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55-link\" aria-label=\"Jump to footnote reference 1\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>This is a brief summary of some of the optimization methods used in Computer Vision topics like Camera Calibration and 3D-Reconstructions. We&#8217;re going to start with a relatively simple method and work our way to the more complicated but also a lot more powerful method that&#8217;s actually used in computer vision topics. Newton&#8217;s Method The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2427,"comment_status":"open","ping_status":"open","sticky":false,"template":"wp-custom-template-single-with-sidebar-1","format":"standard","meta":{"_eb_attr":"","footnotes":"[{\"content\":\"It can happen that we first move towards a maximum, overshoot, and then converge to a minimum but in general if $f''(x) &lt; 0$ then we won't converge to a minimum.\",\"id\":\"37d596ca-ec9e-4cbc-8b86-3e1e7cc4ca55\"}]"},"categories":[38,39],"tags":[40,43],"class_list":["post-2278","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-vision","category-theory","tag-computer-vision","tag-summary"],"_links":{"self":[{"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/posts\/2278","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/comments?post=2278"}],"version-history":[{"count":147,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/posts\/2278\/revisions"}],"predecessor-version":[{"id":2426,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/posts\/2278\/revisions\/2426"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/media\/2427"}],"wp:attachment":[{"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/media?parent=2278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/categories?post=2278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mpr-projects.com\/index.php\/wp-json\/wp\/v2\/tags?post=2278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}