A trip to main memory costs hundreds of clock cycles on commodity hardware. Processors use caching to decrease the costs of memory latency by orders of magnitude. These caches re-order pending memory operations
for the sake of performance. In other words, the reads and writes of a program are not necessarily performed in the order in which they are given to the processor. When data is immutable and/or confined to the scope of one thread these optimizations are harmless.
Combining these optimizations with symmetric multi-processing and shared mutable state on the other hand can be a nightmare. A program can behave non-deterministically when memory operations on shared mutable state are re-ordered.