how this works.
asm ▸ derivation, solver, integrator, precision
no public triple pendulum implementation exists in pure assembly anywhere on the open web. searched github, sourceforge, gitlab, codeberg, rosetta code, the demoscene archives. zero hits across language:Assembly. even double pendulum doesn’t exist in asm. so i wrote one. it ended up beating scipy’s 8th-order DOP853 in energy conservation, because the x87 fpu’s 80-bit registers don’t exist in c, python, rust, go, or js.
▸energy drift across implementations
same physics, same initial conditions, same dt = 1e-4 for the rk4 columns. dE = E(t) − E(0), smaller is better.
| step | scipy dop853 | python rk4 64-bit | asm rk4 80-bit |
|---|---|---|---|
| 160 | +3.1e-17 | +7.2e-14 | −7.2e-18 |
| 1 120 | +6.4e-14 | +2.3e-13 | +7.4e-17 |
| 10 080 | −4.4e-14 | −3.1e-13 | +1.1e-15 |
| 100 000 | −1.7e-12 | +6.2e-12 | +1.4e-13 |
| 200 000 | −5.8e-12 | +1.2e-11 | −5.1e-13 |
▸live from the real asm
the browser visual on the home page is a faithful js twin. the actual nasm binary runs in a terminal and prints its own energy diagnostic. captured from ./triple_pendulum on linux x86-64:
$ ./triple_pendulum + o o O triple_pendulum.asm | e= -0.0000 de=-7.225e-18 step=160 triple_pendulum.asm | e= -0.0000 de=-4.490e-17 step=4960 triple_pendulum.asm | e= 0.0000 de=+1.620e-15 step=9760 triple_pendulum.asm | e= 0.0000 de=+1.124e-14 step=14560 triple_pendulum.asm | e= 0.0000 de=+4.320e-15 step=19360 triple_pendulum.asm | e= 0.0000 de=+3.244e-15 step=24160 triple_pendulum.asm | e= 0.0000 de=+3.834e-15 step=28960 triple_pendulum.asm | e= 0.0000 de=+4.045e-15 step=33760 (captured from nasm + gcc + wsl ubuntu 22.04)
the rest of this page walks through the math, the solver, and the precision tricks behind triple_pendulum.asm. nothing here is novel. the lagrangian derivation is in every classical-mechanics textbook. but assembling it in one place at the level needed to write it in nasm is its own thing.
01 ▸the absence
the project began with a tedious search. github filtered to language:Assembly: zero hits for "triple pendulum", zero for "double pendulum", zero for "n-pendulum", zero for "multi pendulum". sourceforge, gitlab, codeberg, gitea: nothing. rosetta code has an animate a pendulum task for single pendulums; no x86 assembly entry. pouët and demozoo (the demoscene archives, where physics-in-asm should live if it lives anywhere): no triple-pendulum production. the closest thing across the entire open web was an 8085 stepper-motor demo of a single pendulum.
so: not done. the question was whether that meant too hard or nobody bothered. given how much asm gets written for pure aesthetic reasons (4k intros, oldschool demos), the answer was almost certainly the second.
02 ▸the system
three point masses on rigid massless rods, hanging from a fixed pivot:
| masses | m₁ = m₂ = m₃ = 1 kg |
| rod lengths | L₁ = 4 m, L₂ = 3.5 m, L₃ = 3 m (asymmetric on purpose) |
| gravity | g = 9.81 m/s² |
| generalized coords | θ₁, θ₂, θ₃ (angle from downward vertical) |
| initial state | θ = (π/2, π/2, π/2), ω = (0, 0, 0). all three rods horizontal, at rest |
cartesian position of bob i:
yi = −Σk≤i Lk cos θk
03 ▸lagrangian → equations of motion
kinetic energy: T = ½ Σ mi (ẋi² + ẏi²). substitute the cartesian expressions, expand, and collect by ωj ωk. you get
with M symmetric and state-dependent (next section). potential energy is just the sum of −mi g yi:
the euler-lagrange equations d/dt(∂L/∂ωj) − ∂L/∂θj = 0 rearrange to the linear system
which is what the asm builds and solves every rk4 sub-stage.
04 ▸the mass matrix M(θ)
collecting the kinetic-energy expansion gives:
for the m = (1, 1, 1), L = (4, 3.5, 3) values the asm uses, the six αjk are constants:
α11 = 48.0 α12 = 28.0 α13 = 12.0 α22 = 24.5 α23 = 10.5 α33 = 9.0
the diagonal of M is constant; only the three off-diagonals (M₁₂, M₁₃, M₂₃) depend on θ. each is one fpu multiply (a cosine that's already been computed) by one constant α:
fld qword [a12] fld tword [sd_arr + 10] ; cos(θ1 − θ2) fmulp fstp tword [M12_v] ; M12
05 ▸the right-hand side b(θ, ω)
the same lagrangian derivation produces this form for b:
centrifugal-like terms plus gravity. three sin values per angle difference, three sin values per angle, three precomputed g·β·L constants. all the trig is already done in phase 1 (six fsincos calls), so b is just multiplies and adds:
; b1 = −( α12·s12·ω2² + α13·s13·ω3² + g·β1·L1·sin θ1 ) fld qword [a12] fld tword [sd_arr] ; s12 fmulp fld tword [rbx + 40] ; ω2 fmul st0, st0 ; ω2² fmulp fld qword [a13] fld tword [sd_arr + 20] fmulp fld tword [rbx + 50] fmul st0, st0 fmulp faddp fld qword [gb1L1] fld tword [s_arr] fmulp faddp fchs fstp tword [b1_v]
06 ▸solving M·α = b by cramer’s rule
for a 3×3, cramer’s rule is genuinely the simplest correct solver. no pivoting decisions, no iteration, just one division per unknown. the trick is to share the six 2×2 sub-determinants between the main det and the three numerators:
A = M22·M33 − M23² D = b2·M33 − M23·b3 B = M12·M33 − M23·M13 E = b2·M23 − M22·b3 C = M12·M23 − M22·M13 F = M12·b3 − b2·M13 det = M11·A − M12·B + M13·C α1 = ( b1·A − M12·D + M13·E) / det α2 = (M11·D − b1·B + M13·F) / det α3 = (−M11·E − M12·F + b1·C) / det
about 35 multiplies, 12 adds, 3 divides for the whole solve. the asm builds A through F into 80-bit scratch slots, then composes det and the three αi from those.
07 ▸runge-kutta 4
classical fourth-order rk4 on the 6-vector y = (θ₁, θ₂, θ₃, ω₁, ω₂, ω₃):
k1 = f(y) k2 = f(y + (dt/2) k1) k3 = f(y + (dt/2) k2) k4 = f(y + dt k3) y ← y + (dt/6) · (k1 + 2 k2 + 2 k3 + k4)
per-step local truncation error is O(dt⁵). at dt = 1e−4 that’s about 1e−20, which is well below double-precision roundoff. the integrator stops being the dominant error source long before this dt; from here on it’s arithmetic roundoff that matters, which is why the next section is the whole point.
we chose rk4 over a symplectic integrator (leapfrog, verlet) because the triple pendulum is a non-separable hamiltonian: the mass matrix depends on θ, so symplectic-euler / velocity-verlet don’t directly apply without implicit equations. rk4 at small dt with extended precision is the simplest correct option.
08 ▸x87 and the 80-bit angle
the x87 floating-point unit, introduced as the 8087 coprocessor in 1980, has an 8-deep stack of 80-bit extended-precision registers. every arithmetic operation (fadd, fmul, etc.) runs at 80 bits internally regardless of how the operands were loaded. the catch: as soon as you fstp back to a 64-bit double in memory, you round.
nasm gives you a tword directive (10 bytes) that stores the full 80-bit register state. so if you do this:
fld tword [a] ; load 80-bit fld tword [b] fmulp ; multiply in registers at 80-bit fstp tword [c] ; store 80-bit
you never round. no other widely-used environment can do this:
| c (gcc/clang) | long double is 80-bit, but stores still target 64-bit in any sse-default build; getting 80-bit storage requires -mfpmath=387 and breaks abi |
| python | float is 64-bit; mpmath goes higher but is software arbitrary-precision, ~100× slower |
| rust, go, js | no 80-bit type at all. f64 is the ceiling |
| scipy | internally 64-bit double; DOP853 accumulates roundoff per arithmetic op like any 64-bit code |
there is one annoying x87 quirk worth flagging: fld / fstp accept m80fp, but the binary ops fadd, fsub, fmul, fdiv only accept m32fp or m64fp memory operands. so every 80-bit operand has to come into a register via fld tword first, then combine register-to-register with the popping variants (faddp, fsubp, fmulp, fdivp). this is the only reason the asm uses pairs of instructions where python uses one.
09 ▸numbers
at short times the asm is four orders of magnitude better than python rk4 (because 80-bit roundoff is tiny). at long times it beats scipy’s 8th-order adaptive integrator by ten times, despite being 4th-order, because scipy still does its arithmetic at 64-bit and accumulates roundoff per internal step. (the full energy table is at the top of this page.)
scipy is more accurate in state per unit of cpu work. that’s what it’s designed for. energy conservation is a different metric, and at long times it’s arithmetic precision that determines it, which is where 80-bit wins.