cs161 2004 Lecture 16: Benchmarking

Benchmark vs microbenchmarks
 microbenchmarks try to measure the speed of a single operation
 benchmarks (generally) try to measure speed of a "workload"

lmbench generally focuses on OS, rather than arch.
 although ubenchmark, tries to give application performance

Difficulties for accurate answers:
 Cache size variations
 Optimizations
  -O
  use the data to avoid optimizing away
 Clock resolution
 Variation in runs (also a caching effect?)


Bandwith

memory bandwidth
 significance? (upper limit for cached file/web servers)
 10s of MB/s
 means that simple ops on copies are "free" (unless special hardware used)

ipc bandwidth
 penalty for copying (twice) through kernel (copies here are smaller, cached?)
 nonpenalty means krnel can use special ops (or very nice handcoding)
 penalty deifference for pipe/tcp is interesting

cached i/o
 cached file/web server
 compares mmap to file read (mmap usually faster, but not very close to mem rd)
  page table handling overhead (mapping, and faults)

Multiprocessors tests?
 Cache ping-pong (more of an arch test)
 Atomic instructions (semaphores)


Latency
 Mem latency -
  chip cycle  50ns  (also count precharge)
  pin-to-pin  200ns (CPU pins, bus, and back)
  load-in-vacuum (might not count stalls)
  back-to-back (consecutive load/use, list walking)

 Cacheline tradeoff
   big line means more time to wait on random reads
   bigger chance of false sharing
   more waste if not used

 Memory read latency technique
   vary array size and stride
   for (1M) { p = *p }
     size tells total cache size
     stride tells cache line size

OS entry
 syscalls - measure a 1 word write to /dev/null (avoid optimizations)
  would be nice to see fcall numbers
  linux does well (factor of 2-4 with solaris, alpha)
 signal handling
  not super interesting, except that it may have found a bug
 process creation
  simple vs sh -c (for realism)
  handful of milliseconds (goodbye hundred connections / second)
  dynamic loading significantly slows down proc creation
 context switching
  token around ring of processes
   token?  (byte to a pipe)
   removes pipe overhead from results
  array at same VA
   read it each time you get the token
   this cost is removed too?
  deliberately stresses cache (per pid, same VA at different cache lines?)
 ipc latency
  similar to 2 process cs benchmark, but report toal latency
  TCP & RPC/TCP / UDP & RPC/UDP is a study in pointless overhead
  good discussion of handshake overhead
 fs latency
  create many files with small names, delete them
  linux wins again!  (it cheated)
 disk latency
  measures SCSI command overhead
  why such a limited test?  (don't want to measure disks)
  skip the buffer cache with raw devices
  lms or so
 

Today's machine's
 2Ghz = 0.5 ns

 P4 32k L1
 PM 64k L2 + 1Mb L2

 G4 L2 + 2Mb L3
 G5 512k L2