ijtihed.
← back to sim

how this works.

asm ▸ derivation, solver, integrator, precision

no public triple pendulum implementation exists in pure assembly anywhere on the open web. searched github, sourceforge, gitlab, codeberg, rosetta code, the demoscene archives. zero hits across language:Assembly. even double pendulum doesn’t exist in asm. so i wrote one. it ended up beating scipy’s 8th-order DOP853 in energy conservation, because the x87 fpu’s 80-bit registers don’t exist in c, python, rust, go, or js.

energy drift across implementations

same physics, same initial conditions, same dt = 1e-4 for the rk4 columns. dE = E(t) − E(0), smaller is better.

stepscipy dop853python rk4 64-bitasm rk4 80-bit
160+3.1e-17+7.2e-14−7.2e-18
1 120+6.4e-14+2.3e-13+7.4e-17
10 080−4.4e-14−3.1e-13+1.1e-15
100 000−1.7e-12+6.2e-12+1.4e-13
200 000−5.8e-12+1.2e-11−5.1e-13

live from the real asm

the browser visual on the home page is a faithful js twin. the actual nasm binary runs in a terminal and prints its own energy diagnostic. captured from ./triple_pendulum on linux x86-64:

$ ./triple_pendulum
                                       +
                                       o
                                       o
                                       O
triple_pendulum.asm | e=  -0.0000  de=-7.225e-18  step=160
triple_pendulum.asm | e=  -0.0000  de=-4.490e-17  step=4960
triple_pendulum.asm | e=   0.0000  de=+1.620e-15  step=9760
triple_pendulum.asm | e=   0.0000  de=+1.124e-14  step=14560
triple_pendulum.asm | e=   0.0000  de=+4.320e-15  step=19360
triple_pendulum.asm | e=   0.0000  de=+3.244e-15  step=24160
triple_pendulum.asm | e=   0.0000  de=+3.834e-15  step=28960
triple_pendulum.asm | e=   0.0000  de=+4.045e-15  step=33760
(captured from nasm + gcc + wsl ubuntu 22.04)

the rest of this page walks through the math, the solver, and the precision tricks behind triple_pendulum.asm. nothing here is novel. the lagrangian derivation is in every classical-mechanics textbook. but assembling it in one place at the level needed to write it in nasm is its own thing.

contents
  1. the absence
  2. the system
  3. lagrangian, equations of motion
  4. the mass matrix M(θ)
  5. the right-hand side b(θ, ω)
  6. solving M·α = b by cramer’s rule
  7. runge-kutta 4
  8. x87 and the 80-bit angle
  9. numbers

01 ▸the absence

the project began with a tedious search. github filtered to language:Assembly: zero hits for "triple pendulum", zero for "double pendulum", zero for "n-pendulum", zero for "multi pendulum". sourceforge, gitlab, codeberg, gitea: nothing. rosetta code has an animate a pendulum task for single pendulums; no x86 assembly entry. pouët and demozoo (the demoscene archives, where physics-in-asm should live if it lives anywhere): no triple-pendulum production. the closest thing across the entire open web was an 8085 stepper-motor demo of a single pendulum.

so: not done. the question was whether that meant too hard or nobody bothered. given how much asm gets written for pure aesthetic reasons (4k intros, oldschool demos), the answer was almost certainly the second.

02 ▸the system

three point masses on rigid massless rods, hanging from a fixed pivot:

massesm₁ = m₂ = m₃ = 1 kg
rod lengthsL₁ = 4 m, L₂ = 3.5 m, L₃ = 3 m (asymmetric on purpose)
gravityg = 9.81 m/s²
generalized coordsθ₁, θ₂, θ₃ (angle from downward vertical)
initial stateθ = (π/2, π/2, π/2), ω = (0, 0, 0). all three rods horizontal, at rest

cartesian position of bob i:

xi  =  Σk≤i Lk sin θk
yi  =  −Σk≤i Lk cos θk

03 ▸lagrangian → equations of motion

kinetic energy: T = ½ Σ mi (ẋi² + ẏi²). substitute the cartesian expressions, expand, and collect by ωj ωk. you get

T  =  ½ Σj, k Mjk(θ) ωj ωk

with M symmetric and state-dependent (next section). potential energy is just the sum of −mi g yi:

V  =  −Σk βk Lk g cos θk βk = Σi ≥ k mi

the euler-lagrange equations d/dt(∂L/∂ωj) − ∂L/∂θj = 0 rearrange to the linear system

Σk Mjk αk  =  bj αk = ω̇k

which is what the asm builds and solves every rk4 sub-stage.

04 ▸the mass matrix M(θ)

collecting the kinetic-energy expansion gives:

Mjk(θ)  =  αjk · cos(θj − θk) αjk = (Σ mi, i ≥ max(j, k)) · Lj · Lk

for the m = (1, 1, 1), L = (4, 3.5, 3) values the asm uses, the six αjk are constants:

α11 = 48.0    α12 = 28.0    α13 = 12.0
α22 = 24.5    α23 = 10.5    α33 =  9.0

the diagonal of M is constant; only the three off-diagonals (M₁₂, M₁₃, M₂₃) depend on θ. each is one fpu multiply (a cosine that's already been computed) by one constant α:

fld     qword [a12]
fld     tword [sd_arr + 10]   ; cos(θ1 − θ2)
fmulp
fstp    tword [M12_v]          ; M12

05 ▸the right-hand side b(θ, ω)

the same lagrangian derivation produces this form for b:

bj  =  −Σk≠j αjk sin(θj − θk) ωk²  −  g βj Lj sin θj

centrifugal-like terms plus gravity. three sin values per angle difference, three sin values per angle, three precomputed g·β·L constants. all the trig is already done in phase 1 (six fsincos calls), so b is just multiplies and adds:

; b1 = −( α12·s12·ω2² + α13·s13·ω3² + g·β1·L1·sin θ1 )
fld     qword [a12]
fld     tword [sd_arr]          ; s12
fmulp
fld     tword [rbx + 40]       ; ω2
fmul    st0, st0                  ; ω2²
fmulp
fld     qword [a13]
fld     tword [sd_arr + 20]
fmulp
fld     tword [rbx + 50]
fmul    st0, st0
fmulp
faddp
fld     qword [gb1L1]
fld     tword [s_arr]
fmulp
faddp
fchs
fstp    tword [b1_v]

06 ▸solving M·α = b by cramer’s rule

for a 3×3, cramer’s rule is genuinely the simplest correct solver. no pivoting decisions, no iteration, just one division per unknown. the trick is to share the six 2×2 sub-determinants between the main det and the three numerators:

A = M22·M33 − M23²        D = b2·M33 − M23·b3
B = M12·M33 − M23·M13     E = b2·M23 − M22·b3
C = M12·M23 − M22·M13     F = M12·b3 − b2·M13

det  =  M11·A − M12·B + M13·C
α1   = ( b1·A − M12·D + M13·E) / det
α2   = (M11·D −  b1·B + M13·F) / det
α3   = (−M11·E − M12·F + b1·C) / det

about 35 multiplies, 12 adds, 3 divides for the whole solve. the asm builds A through F into 80-bit scratch slots, then composes det and the three αi from those.

07 ▸runge-kutta 4

classical fourth-order rk4 on the 6-vector y = (θ₁, θ₂, θ₃, ω₁, ω₂, ω₃):

k1 = f(y)
k2 = f(y + (dt/2) k1)
k3 = f(y + (dt/2) k2)
k4 = f(y + dt k3)
y  ← y + (dt/6) · (k1 + 2 k2 + 2 k3 + k4)

per-step local truncation error is O(dt⁵). at dt = 1e−4 that’s about 1e−20, which is well below double-precision roundoff. the integrator stops being the dominant error source long before this dt; from here on it’s arithmetic roundoff that matters, which is why the next section is the whole point.

we chose rk4 over a symplectic integrator (leapfrog, verlet) because the triple pendulum is a non-separable hamiltonian: the mass matrix depends on θ, so symplectic-euler / velocity-verlet don’t directly apply without implicit equations. rk4 at small dt with extended precision is the simplest correct option.

08 ▸x87 and the 80-bit angle

the x87 floating-point unit, introduced as the 8087 coprocessor in 1980, has an 8-deep stack of 80-bit extended-precision registers. every arithmetic operation (fadd, fmul, etc.) runs at 80 bits internally regardless of how the operands were loaded. the catch: as soon as you fstp back to a 64-bit double in memory, you round.

nasm gives you a tword directive (10 bytes) that stores the full 80-bit register state. so if you do this:

fld     tword [a]      ; load 80-bit
fld     tword [b]
fmulp                  ; multiply in registers at 80-bit
fstp    tword [c]      ; store 80-bit

you never round. no other widely-used environment can do this:

c (gcc/clang)long double is 80-bit, but stores still target 64-bit in any sse-default build; getting 80-bit storage requires -mfpmath=387 and breaks abi
pythonfloat is 64-bit; mpmath goes higher but is software arbitrary-precision, ~100× slower
rust, go, jsno 80-bit type at all. f64 is the ceiling
scipyinternally 64-bit double; DOP853 accumulates roundoff per arithmetic op like any 64-bit code

there is one annoying x87 quirk worth flagging: fld / fstp accept m80fp, but the binary ops fadd, fsub, fmul, fdiv only accept m32fp or m64fp memory operands. so every 80-bit operand has to come into a register via fld tword first, then combine register-to-register with the popping variants (faddp, fsubp, fmulp, fdivp). this is the only reason the asm uses pairs of instructions where python uses one.

09 ▸numbers

at short times the asm is four orders of magnitude better than python rk4 (because 80-bit roundoff is tiny). at long times it beats scipy’s 8th-order adaptive integrator by ten times, despite being 4th-order, because scipy still does its arithmetic at 64-bit and accumulates roundoff per internal step. (the full energy table is at the top of this page.)

scipy is more accurate in state per unit of cpu work. that’s what it’s designed for. energy conservation is a different metric, and at long times it’s arithmetic precision that determines it, which is where 80-bit wins.