how this works.

asm ▸ derivation, solver, integrator, precision

no public triple pendulum implementation exists in pure assembly anywhere on the open web. searched github, sourceforge, gitlab, codeberg, rosetta code, the demoscene archives. zero hits across language:Assembly. even double pendulum doesn’t exist in asm. so i wrote one. it ended up beating scipy’s 8th-order DOP853 in energy conservation, because the x87 fpu’s 80-bit registers don’t exist in c, python, rust, go, or js.

▸energy drift across implementations

same physics, same initial conditions, same dt = 1e-4 for the rk4 columns. dE = E(t) − E(0), smaller is better.

step	scipy dop853	python rk4 64-bit	asm rk4 80-bit
160	+3.1e-17	+7.2e-14	−7.2e-18
1 120	+6.4e-14	+2.3e-13	+7.4e-17
10 080	−4.4e-14	−3.1e-13	+1.1e-15
100 000	−1.7e-12	+6.2e-12	+1.4e-13
200 000	−5.8e-12	+1.2e-11	−5.1e-13

▸live from the real asm

the browser visual on the home page is a faithful js twin. the actual nasm binary runs in a terminal and prints its own energy diagnostic. captured from ./triple_pendulum on linux x86-64:

$ ./triple_pendulum
                                       +
                                       o
                                       o
                                       O
triple_pendulum.asm | e=  -0.0000  de=-7.225e-18  step=160
triple_pendulum.asm | e=  -0.0000  de=-4.490e-17  step=4960
triple_pendulum.asm | e=   0.0000  de=+1.620e-15  step=9760
triple_pendulum.asm | e=   0.0000  de=+1.124e-14  step=14560
triple_pendulum.asm | e=   0.0000  de=+4.320e-15  step=19360
triple_pendulum.asm | e=   0.0000  de=+3.244e-15  step=24160
triple_pendulum.asm | e=   0.0000  de=+3.834e-15  step=28960
triple_pendulum.asm | e=   0.0000  de=+4.045e-15  step=33760
(captured from nasm + gcc + wsl ubuntu 22.04)

the rest of this page walks through the math, the solver, and the precision tricks behind triple_pendulum.asm. nothing here is novel. the lagrangian derivation is in every classical-mechanics textbook. but assembling it in one place at the level needed to write it in nasm is its own thing.

contents

the absence
the system
lagrangian, equations of motion
the mass matrix M(θ)
the right-hand side b(θ, ω)
solving M·α = b by cramer’s rule
runge-kutta 4
x87 and the 80-bit angle
numbers

01 ▸the absence

the project began with a tedious search. github filtered to language:Assembly: zero hits for "triple pendulum", zero for "double pendulum", zero for "n-pendulum", zero for "multi pendulum". sourceforge, gitlab, codeberg, gitea: nothing. rosetta code has an animate a pendulum task for single pendulums; no x86 assembly entry. pouët and demozoo (the demoscene archives, where physics-in-asm should live if it lives anywhere): no triple-pendulum production. the closest thing across the entire open web was an 8085 stepper-motor demo of a single pendulum.

so: not done. the question was whether that meant too hard or nobody bothered. given how much asm gets written for pure aesthetic reasons (4k intros, oldschool demos), the answer was almost certainly the second.

02 ▸the system

three point masses on rigid massless rods, hanging from a fixed pivot:

masses	m₁ = m₂ = m₃ = 1 kg
rod lengths	L₁ = 4 m, L₂ = 3.5 m, L₃ = 3 m (asymmetric on purpose)
gravity	g = 9.81 m/s²
generalized coords	θ₁, θ₂, θ₃ (angle from downward vertical)
initial state	θ = (π/2, π/2, π/2), ω = (0, 0, 0). all three rods horizontal, at rest

cartesian position of bob i:

x_i = Σ_k≤i L_k sin θ_k
y_i = −Σ_k≤i L_k cos θ_k

03 ▸lagrangian → equations of motion

kinetic energy: T = ½ Σ m_i (ẋ_i² + ẏ_i²). substitute the cartesian expressions, expand, and collect by ω_j ω_k. you get

T = ½ Σ_{j, k} M_jk(θ) ω_j ω_k

with M symmetric and state-dependent (next section). potential energy is just the sum of −m_i g y_i:

V = −Σ_k β_k L_k g cos θ_k β_k = Σ_{i ≥ k} m_i

the euler-lagrange equations d/dt(∂L/∂ω_j) − ∂L/∂θ_j = 0 rearrange to the linear system

Σ_k M_jk α_k = b_j α_k = ω̇_k

which is what the asm builds and solves every rk4 sub-stage.

04 ▸the mass matrix M(θ)

collecting the kinetic-energy expansion gives:

M_jk(θ) = α_jk · cos(θ_j − θ_k) α_jk = (Σ m_i, i ≥ max(j, k)) · L_j · L_k

for the m = (1, 1, 1), L = (4, 3.5, 3) values the asm uses, the six α_jk are constants:

α11 = 48.0    α12 = 28.0    α13 = 12.0
α22 = 24.5    α23 = 10.5    α33 =  9.0

the diagonal of M is constant; only the three off-diagonals (M₁₂, M₁₃, M₂₃) depend on θ. each is one fpu multiply (a cosine that's already been computed) by one constant α:

fld     qword [a12]
fld     tword [sd_arr + 10]   ; cos(θ1 − θ2)
fmulp
fstp    tword [M12_v]          ; M12

05 ▸the right-hand side b(θ, ω)

the same lagrangian derivation produces this form for b:

b_j = −Σ_k≠j α_jk sin(θ_j − θ_k) ω_k² − g β_j L_j sin θ_j

centrifugal-like terms plus gravity. three sin values per angle difference, three sin values per angle, three precomputed g·β·L constants. all the trig is already done in phase 1 (six fsincos calls), so b is just multiplies and adds:

; b1 = −( α12·s12·ω2² + α13·s13·ω3² + g·β1·L1·sin θ1 )
fld     qword [a12]
fld     tword [sd_arr]          ; s12
fmulp
fld     tword [rbx + 40]       ; ω2
fmul    st0, st0                  ; ω2²
fmulp
fld     qword [a13]
fld     tword [sd_arr + 20]
fmulp
fld     tword [rbx + 50]
fmul    st0, st0
fmulp
faddp
fld     qword [gb1L1]
fld     tword [s_arr]
fmulp
faddp
fchs
fstp    tword [b1_v]

06 ▸solving M·α = b by cramer’s rule

for a 3×3, cramer’s rule is genuinely the simplest correct solver. no pivoting decisions, no iteration, just one division per unknown. the trick is to share the six 2×2 sub-determinants between the main det and the three numerators:

A = M22·M33 − M23²        D = b2·M33 − M23·b3
B = M12·M33 − M23·M13     E = b2·M23 − M22·b3
C = M12·M23 − M22·M13     F = M12·b3 − b2·M13

det  =  M11·A − M12·B + M13·C
α1   = ( b1·A − M12·D + M13·E) / det
α2   = (M11·D −  b1·B + M13·F) / det
α3   = (−M11·E − M12·F + b1·C) / det

about 35 multiplies, 12 adds, 3 divides for the whole solve. the asm builds A through F into 80-bit scratch slots, then composes det and the three α_i from those.

07 ▸runge-kutta 4

classical fourth-order rk4 on the 6-vector y = (θ₁, θ₂, θ₃, ω₁, ω₂, ω₃):

k1 = f(y)
k2 = f(y + (dt/2) k1)
k3 = f(y + (dt/2) k2)
k4 = f(y + dt k3)
y  ← y + (dt/6) · (k1 + 2 k2 + 2 k3 + k4)

per-step local truncation error is O(dt⁵). at dt = 1e−4 that’s about 1e−20, which is well below double-precision roundoff. the integrator stops being the dominant error source long before this dt; from here on it’s arithmetic roundoff that matters, which is why the next section is the whole point.

we chose rk4 over a symplectic integrator (leapfrog, verlet) because the triple pendulum is a non-separable hamiltonian: the mass matrix depends on θ, so symplectic-euler / velocity-verlet don’t directly apply without implicit equations. rk4 at small dt with extended precision is the simplest correct option.

08 ▸x87 and the 80-bit angle

the x87 floating-point unit, introduced as the 8087 coprocessor in 1980, has an 8-deep stack of 80-bit extended-precision registers. every arithmetic operation (fadd, fmul, etc.) runs at 80 bits internally regardless of how the operands were loaded. the catch: as soon as you fstp back to a 64-bit double in memory, you round.

nasm gives you a tword directive (10 bytes) that stores the full 80-bit register state. so if you do this:

fld     tword [a]      ; load 80-bit
fld     tword [b]
fmulp                  ; multiply in registers at 80-bit
fstp    tword [c]      ; store 80-bit

you never round. no other widely-used environment can do this:

c (gcc/clang)	`long double` is 80-bit, but stores still target 64-bit in any sse-default build; getting 80-bit storage requires `-mfpmath=387` and breaks abi
python	`float` is 64-bit; `mpmath` goes higher but is software arbitrary-precision, ~100× slower
rust, go, js	no 80-bit type at all. `f64` is the ceiling
scipy	internally 64-bit double; `DOP853` accumulates roundoff per arithmetic op like any 64-bit code

there is one annoying x87 quirk worth flagging: fld / fstp accept m80fp, but the binary ops fadd, fsub, fmul, fdiv only accept m32fp or m64fp memory operands. so every 80-bit operand has to come into a register via fld tword first, then combine register-to-register with the popping variants (faddp, fsubp, fmulp, fdivp). this is the only reason the asm uses pairs of instructions where python uses one.

09 ▸numbers

at short times the asm is four orders of magnitude better than python rk4 (because 80-bit roundoff is tiny). at long times it beats scipy’s 8th-order adaptive integrator by ten times, despite being 4th-order, because scipy still does its arithmetic at 64-bit and accumulates roundoff per internal step. (the full energy table is at the top of this page.)

scipy is more accurate in state per unit of cpu work. that’s what it’s designed for. energy conservation is a different metric, and at long times it’s arithmetic precision that determines it, which is where 80-bit wins.