cs161 2004 Lecture 16: Benchmarking Benchmark vs microbenchmarks microbenchmarks try to measure the speed of a single operation benchmarks (generally) try to measure speed of a "workload" lmbench generally focuses on OS, rather than arch. although ubenchmark, tries to give application performance Difficulties for accurate answers: Cache size variations Optimizations -O use the data to avoid optimizing away Clock resolution Variation in runs (also a caching effect?) Bandwith memory bandwidth significance? (upper limit for cached file/web servers) 10s of MB/s means that simple ops on copies are "free" (unless special hardware used) ipc bandwidth penalty for copying (twice) through kernel (copies here are smaller, cached?) nonpenalty means krnel can use special ops (or very nice handcoding) penalty deifference for pipe/tcp is interesting cached i/o cached file/web server compares mmap to file read (mmap usually faster, but not very close to mem rd) page table handling overhead (mapping, and faults) Multiprocessors tests? Cache ping-pong (more of an arch test) Atomic instructions (semaphores) Latency Mem latency - chip cycle 50ns (also count precharge) pin-to-pin 200ns (CPU pins, bus, and back) load-in-vacuum (might not count stalls) back-to-back (consecutive load/use, list walking) Cacheline tradeoff big line means more time to wait on random reads bigger chance of false sharing more waste if not used Memory read latency technique vary array size and stride for (1M) { p = *p } size tells total cache size stride tells cache line size OS entry syscalls - measure a 1 word write to /dev/null (avoid optimizations) would be nice to see fcall numbers linux does well (factor of 2-4 with solaris, alpha) signal handling not super interesting, except that it may have found a bug process creation simple vs sh -c (for realism) handful of milliseconds (goodbye hundred connections / second) dynamic loading significantly slows down proc creation context switching token around ring of processes token? (byte to a pipe) removes pipe overhead from results array at same VA read it each time you get the token this cost is removed too? deliberately stresses cache (per pid, same VA at different cache lines?) ipc latency similar to 2 process cs benchmark, but report toal latency TCP & RPC/TCP / UDP & RPC/UDP is a study in pointless overhead good discussion of handshake overhead fs latency create many files with small names, delete them linux wins again! (it cheated) disk latency measures SCSI command overhead why such a limited test? (don't want to measure disks) skip the buffer cache with raw devices lms or so Today's machine's 2Ghz = 0.5 ns P4 32k L1 PM 64k L2 + 1Mb L2 G4 L2 + 2Mb L3 G5 512k L2