Linux Perf Tools




Arnaldo Carvalho de Melo
acme@redhat.com
https://fedorapeople.org/~acme/prez/perf-tools-CERN-2023

perf origins and model



  • oprofile
  • Daemonless
  • hosted in the kernel sources
  • but not in lockstep with the kernel
  • Who contributes to perf

Who contributes to perf: me



  • Computer Sciences at UFAL
  • Banco do Brasil: ATM/Internet software
  • Using Linux as a development platform
  • Conectiva: Everything Linux

Conectiva



  • Migration to the Internet
  • All sorts of industries
  • First brazilian ISPs
  • Linux
  • Conectiva (Red Hat) Linux
  • Training
  • Revista do Linux

Linux contributions at Conectiva



  • VM: Rik van Riel and Marcelo Tosatti
  • 2.4 kernel maintainance
  • XFree86 VESA driver: Paulo Andrade
  • netfilter maintainance: Harald Welte
  • CUPS: Till Kampeeter
  • apt-rpm: Kojima and Niemeyer
  • Network protocol stack: Myself

Starting to observe



  • Optimizing networking structs
  • Reducing its size
  • pahole

Wearing a Red Hat



  • Real Time effort
  • Determinism
  • RHEL for financial services
  • Latency tracer becomes ftrace
  • perf

Why perf



  • Problem characterization
  • Support
  • Investigating new sofware/hardware
  • Main client: kernel developers

Problems?



  • Cache accesses to PCIe bus
  • Narrow spinlock protected areas
  • Replace raw spinlocks with sleepable ones
  • New tools to trace/profile all that

perf maintainership



  • Arnaldo Melo
  • Namhyung Kim
  • Jiri Olsa
  • Ian Rogers
  • Adrian Hunter

Some stats



more stats


git log --committer=acme@redhat.com --no-merges v6.5.. .
Changesets: 215
   1   80 Ian Rogers                      Google
   2   32 Arnaldo Carvalho de Melo        Red Hat
   3   20 Kajol Jain                      IBM
   4   15 Namhyung Kim                    Google
   5   15 Athira Rajeev                   IBM
   6    9 Anup Sharma                     GSoC
   7    8 James Clark                     ARM
   8    6 Leo Yan                         Linaro
   9    4 Ilkka Koskinen                  Ampere Computing
  10    3 Yanteng Si                      Longson
  11    3 Jing Zhang                      Alibaba
  12    3 Adrian Hunter                   Intel
  13    2 Thomas Richter                  IBM
  14    2 Wei Li                          Huawei
  15    2 Alexandre Ghiti                 Rivosinc
  16    1 Kaige Ye
  17    1 Yang Jihong                     Huawei
  18    1 Xiu Jianfeng                    Huawei
  19    1 Masami Hiramatsu                Linaro
  20    1 Lu Hongfei                      Vivo
  21    1 John Garry                      Oracle
  22    1 Ivan Babrou                     Cloudflare
  23    1 Changbin Du                     Huawei
  24    1 Guilherme Amadio                CERN
  25    1 Akanksha J N                    IBM
  26    1 Aditya Gupta                    IBM
					

more stats


git log --committer=acme@redhat.com --no-merges v6.0..
--no-merges
v6.0..
Changesets: 1578
   1  627 Ian Rogers                      Google
   2  238 Namhyung Kim                    Google
   3  172 Arnaldo Carvalho de Melo        Red Hat
   4   77 Adrian Hunter                   Intel
   5   53 James Clark                     ARM
   6   34 Leo Yan                         Linaro
   7   34 Athira Rajeev                   IBM
   8   23 Kajol Jain                      IBM
   9   21 Thomas Richter                  IBM
  10   20 Ravi Bangoria                   AMD
  11   19 Kan Liang                       Intel
  12   14 Tiezhu Yang                     Loongson
  13   13 Carsten Haitzler                ARM
  14   12 Yang Jihong                     Huawei
  15   12 Jing Zhang                      Alibaba
  16   11 Shang XiaoJing                  Huawei
  17   10 Yanteng Si                      Longson
  18   10 Changbin Du                     Huawei
  19   10 Anup Sharma                     GSoC
  20    9 Sandipan Das                    AMD
  21    8 German Gomez                    ARM
$
					

What is being done?



Testing it


$ export BUILD_TARBALL=http://192.168.86.10/perf/perf-6.5.0-rc5.tar.xz
$ dm
  1 almalinux:8                  : Ok  gcc 8.5.0, clang 15.0.7, flex 2.6.1
  2 almalinux:9                  : Ok  gcc 11.3.1, clang 15.0.7, flex 2.6.4
  3 alpine:3.15                  : Ok  gcc 10.3.1, Alpine clang 12.0.1, flex 2.6.4
  4 alpine:3.16                  : Ok  gcc 11.2.1, Alpine clang 13.0.1, flex 2.6.4
  5 alpine:3.17                  : Ok  gcc 12.2.1, Alpine clang 15.0.7, flex 2.6.4
  6 alpine:3.18                  : Ok  gcc 12.2.1, Alpine clang 16.0.6, flex 2.6.4
  7 alpine:edge                  : Ok  gcc 13.1.1, Alpine clang 16.0.6, flex 2.6.4
  8 amazonlinux:2                : Ok  gcc 7.3.1 20180712, clang 11.1.0, flex 2.5.37
  9 amazonlinux:2023             : Ok  gcc 11.3.1, clang 15.0.6, flex 2.6.4
 10 amazonlinux:devel            : Ok  gcc 11.3.1, clang 15.0.6, flex 2.6.4
 11 archlinux:base               : Ok  gcc 13.2.1, clang 15.0.7, flex 2.6.4
 12 centos:stream                : Ok  gcc 8.5.0, clang 16.0.6, flex 2.6.1
 13 clearlinux:latest            : Ok  gcc 13.2.1, clang 16.0.6, flex 2.6.4
 14 debian:10                    : Ok  gcc 8.3.0, clang 11.0.1-2~deb10u1, flex 2.6.4
 15 debian:11                    : Ok  gcc 10.2.1, clang 13.0.1-6~deb11u1, flex 2.6.4
 16 debian:12                    : Ok  gcc 12.2.0, clang 14.0.6, flex 2.6.4
 17 debian:experimental          : Ok  gcc 13.2.0, clang 14.0.6, flex 2.6.4
 18 debian:experimental-x-arm64  : Ok  aarch64-linux-gnu-gcc 13.2.0, flex 2.6.4
 19 debian:experimental-x-mips   : Ok  mips-linux-gnu-gcc 12.3.0, flex 2.6.4
 20 debian:experimental-x-mips64 : Ok  mips64-linux-gnuabi64-gcc 12.3.0, flex 2.6.4
 21 debian:experimental-x-mipsel : Ok  mipsel-linux-gnu-gcc 12.3.0, flex 2.6.4
 22 fedora:35                    : Ok  gcc 11.3.1, clang 13.0.1, flex 2.6.4
 23 fedora:36                    : Ok  gcc 12.2.1, clang 14.0.5, flex 2.6.4
 24 fedora:37                    : Ok  gcc 12.3.1, clang 15.0.7, flex 2.6.4
 25 fedora:38                    : Ok  gcc 13.2.1, clang 16.0.6, flex 2.6.4
 26 fedora:39                    : Ok  gcc 13.2.1, clang 16.0.6, flex 2.6.4
 27 fedora:rawhide               : Ok  gcc 13.2.1, clang 16.0.6, flex 2.6.4
					

Testing it 2


 28 gentoo:stage3                : Ok  gcc 12.3.1, clang 16.0.6, flex 2.6.4
 29 manjaro:base                 : Ok  gcc 13.2.1, clang 15.0.7, flex 2.6.4
 30 opensuse:15.4                : Ok  gcc 7.5.0, clang 15.0.7, flex 2.6.4
 31 opensuse:15.5                : Ok  gcc 7.5.0 clang 15.0.7, flex 2.6.4
 32 opensuse:tumbleweed          : Ok  gcc 13.1.1, clang 16.0.6, flex 2.6.4
 33 oraclelinux:8                : Ok  gcc 8.5.0, clang 15.0.7, flex 2.6.1
 34 oraclelinux:9                : Ok  gcc 11.3.1, clang 15.0.7, flex 2.6.4
 35 rockylinux:8                 : Ok  gcc 8.5.0, clang 15.0.7, flex 2.6.1
 36 rockylinux:9                 : Ok  gcc 11.3.1, clang 15.0.7, flex 2.6.4
 37 ubuntu:18.04                 : Ok  gcc 7.5.0, flex 2.6.4
 38 ubuntu:18.04-x-arm           : Ok  arm-linux-gnueabihf-gcc 7.5.0, flex 2.6.4
 39 ubuntu:18.04-x-arm64         : Ok  aarch64-linux-gnu-gcc 7.5.0, flex 2.6.4
 40 ubuntu:18.04-x-m68k          : Ok  m68k-linux-gnu-gcc 7.5.0, flex 2.6.4
 41 ubuntu:18.04-x-powerpc       : Ok  powerpc-linux-gnu-gcc 7.5.0, flex 2.6.4
 42 ubuntu:18.04-x-powerpc64     : Ok  powerpc64-linux-gnu-gcc 7.5.0, flex 2.6.4
 43 ubuntu:18.04-x-powerpc64el   : Ok  powerpc64le-linux-gnu-gcc 7.5.0, flex 2.6.4
 44 ubuntu:18.04-x-riscv64       : Ok  riscv64-linux-gnu-gcc 7.5.0, flex 2.6.4
 45 ubuntu:18.04-x-s390          : Ok  s390x-linux-gnu-gcc 7.5.0, flex 2.6.4
 46 ubuntu:18.04-x-sh4           : Ok  sh4-linux-gnu-gcc 7.5.0, flex 2.6.4
 47 ubuntu:18.04-x-sparc64       : Ok  sparc64-linux-gnu-gcc 7.5.0, flex 2.6.4
 48 ubuntu:20.04                 : Ok  gcc 9.4.0, flex 2.6.4
 49 ubuntu:20.04-x-powerpc64el   : Ok  powerpc64le-linux-gnu-gcc 10.5.0, flex 2.6.4
 50 ubuntu:22.04                 : Ok  gcc 11.4.0, clang 14.0.0-1ubuntu1.1, flex 2.6.4
 51 ubuntu:22.10                 : Ok  gcc 12.2.0, clang 15.0.7, flex 2.6.4
 52 ubuntu:23.04                 : Ok  gcc 12.3.0, clang 15.0.7, flex 2.6.4
 53 ubuntu:23.10                 : Ok  gcc 13.2.0, clang 16.0.6, flex 2.6.4
$
					

perf test



  • Regression testing
  • But present in production
  • Test current hardware
  • And libraries
  • Showcases features
  • Shell tests as usage examples

perf test


# perf test
  1: vmlinux symtab matches kallsyms                                 : Ok
  2: Detect openat syscall event                                     : Ok
  3: Detect openat syscall event on all cpus                         : Ok
  4: mmap interface tests                                            :
  4.1: Read samples using the mmap interface                         : Ok
  4.2: User space counter reading of instructions                    : Ok
  4.3: User space counter reading of cycles                          : Ok
  5: Test data source output                                         : Ok
  6: Parse event definition strings                                  :
  6.1: Test event parsing                                            : Ok
  6.2: Parsing of all PMU events from sysfs                          : Ok
  6.3: Parsing of given PMU events from sysfs                        : Ok
  6.4: Parsing of aliased events from sysfs                          : Skip (no aliases in sysfs)
  6.5: Parsing of aliased events                                     : Ok
  6.6: Parsing of terms (event modifiers)                            : Ok
  7: Simple expression parser                                        : Ok
  8: PERF_RECORD_* events & perf_sample fields                       : Ok
  9: Parse perf pmu format                                           : Ok
 10: PMU events                                                      :
 10.1: PMU event table sanity                                        : Ok
 10.2: PMU event map aliases                                         : Ok
 10.3: Parsing of PMU event table metrics                            : Ok
 10.4: Parsing of PMU event table metrics with fake PMUs             : Ok
 10.5: Parsing of metric thresholds with fake PMUs                   : Ok
 11: DSO data read                                                   : Ok
 12: DSO data cache                                                  : Ok
 13: DSO data reopen                                                 : Ok
 14: Roundtrip evsel->name                                           : Ok
 15: Parse sched tracepoints fields                                  : Ok
 16: syscalls:sys_enter_openat event fields                          : Ok
 17: Setup struct perf_event_attr                                    : Ok
					

perf test 2


 18: Match and link multiple hists                                   : Ok
 19: 'import perf' in python                                         : Ok
 20: Breakpoint overflow signal handler                              : Ok
 21: Breakpoint overflow sampling                                    : Ok
 22: Breakpoint accounting                                           : Ok
 23: Watchpoint                                                      :
 23.1: Read Only Watchpoint                                          : Skip (missing hardware support)
 23.2: Write Only Watchpoint                                         : Ok
 23.3: Read / Write Watchpoint                                       : Ok
 23.4: Modify Watchpoint                                             : Ok
 24: Number of exit events of a simple workload                      : Ok
 25: Software clock events period values                             : Ok
 26: Object code reading                                             : Ok
 27: Sample parsing                                                  : Ok
 28: Use a dummy software event to keep tracking                     : Ok
 29: Parse with no sample_id_all bit set                             : Ok
 30: Filter hist entries                                             : Ok
 31: Lookup mmap thread                                              : Ok
 32: Share thread maps                                               : Ok
 33: Sort output of hist entries                                     : Ok
 34: Cumulate child hist entries                                     : Ok
 35: Track with sched_switch                                         : Ok
 36: Filter fds with revents mask in a fdarray                       : Ok
 37: Add fd to a fdarray, making it autogrow                         : Ok
 38: kmod_path__parse                                                : Ok
 39: Thread map                                                      : Ok
 40: Session topology                                                : Ok
 41: Synthesize thread map                                           : Ok
 42: Remove thread map                                               : Ok
					

perf test 3


 43: CPU map                                                         :
 43.1: Synthesize cpu map                                            : Ok
 43.2: Print cpu map                                                 : Ok
 43.3: Merge cpu map                                                 : Ok
 43.4: Intersect cpu map                                             : Ok
 43.5: Equal cpu map                                                 : Ok
 44: Synthesize stat config                                          : Ok
 45: Synthesize stat                                                 : Ok
 46: Synthesize stat round                                           : Ok
 47: Synthesize attr update                                          : Ok
 48: Event times                                                     : Ok
 49: Read backward ring buffer                                       : Ok
 50: Probe SDT events                                                : Ok
 51: is_printable_array                                              : Ok
 52: Print bitmap                                                    : Ok
 53: perf hooks                                                      : Ok
 54: unit_number__scnprintf                                          : Ok
 55: mem2node                                                        : Ok
 56: time utils                                                      : Ok
 57: Test jit_write_elf                                              : Ok
 58: Test libpfm4 support                                            :
 58.1: test of individual --pfm-events                               : Ok
 58.2: test groups of --pfm-events                                   : Ok
 59: Test api io                                                     : Ok
 60: maps__merge_in                                                  : Ok
 61: Demangle Java                                                   : Ok
 62: Demangle OCaml                                                  : Ok
 63: Parse and process metrics                                       : Ok
 64: PE file support                                                 : Skip
					

perf test 4


 65: Event expansion for cgroups                                     : Ok
 66: Convert perf time to TSC                                        :
 66.1: TSC support                                                   : Ok
 66.2: Perf time to TSC                                              : Ok
 67: dlfilter C API                                                  : Ok
 68: Sigtrap                                                         : Ok
 69: Event groups                                                    : Skip
 70: Symbols                                                         : Ok
 71: Test dwarf unwind                                               : Ok
 72: Intel PT                                                        :
 72.1: Intel PT packet decoder                                       : Ok
 72.2: Intel PT hybrid CPU compatibility                             : Skip
 73: x86 bp modify                                                   : Ok
 74: x86 Sample parsing                                              : Ok
 75: AMD IBS via core pmu                                            : Skip
 76: x86 hybrid                                                      : Skip (not hybrid)
 77: build id cache operations                                       : Ok
 78: CoreSight / ASM Pure Loop                                       : Skip
 79: CoreSight / Memcpy 16k 10 Threads                               : Skip
 80: CoreSight / Thread Loop 10 Threads - Check TID                  : Skip
 81: CoreSight / Thread Loop 2 Threads - Check TID                   : Skip
 82: CoreSight / Unroll Loop Thread 10                               : Skip
 83: daemon operations                                               : Ok
 84: kernel lock contention analysis test                            : Ok
 85: perf pipe recording and injection test                          : Ok
 86: Add vfs_getname probe to get syscall args filenames             : Ok
 87: probe libc's inet_pton & backtrace it with ping                 : Ok
 88: Use vfs_getname probe to get syscall args filenames             : Ok
 89: Zstd perf.data compression/decompression                        : Ok
					

perf test 5


 90: perf record tests                                               : Ok
 91: perf record sample filtering (by BPF) tests                     : Ok
 92: perf record offcpu profiling tests                              : Ok
 93: perf stat CSV output linter                                     : Ok
 94: perf stat csv summary test                                      : Ok
 95: perf stat JSON output linter                                    : Ok
 96: perf stat metrics (shadow stat) test                            : Ok
 97: perf stat STD output linter                                     : Ok
 98: perf stat tests                                                 : Ok
 99: perf all metricgroups test                                      : Ok
100: perf all metrics test                                           : FAILED!
101: perf all libpfm4 events test                                    : Ok
102: perf all PMU test                                               : Ok
103: perf stat --bpf-counters test                                   : Ok
104: perf stat --bpf-counters --for-each-cgroup test                 : Ok
105: perf metrics value validation                                   : Ok
106: Check Arm64 callgraphs are complete in fp mode                  : Skip
107: Check Arm CoreSight trace data recording and synthesized samples: Skip
108: Check Arm SPE trace data recording and synthesized samples      : Skip
109: Check Arm SPE doesn't hang when there are forks                 : Skip
110: Check branch stack sampling                                     : Ok
111: Test data symbol                                                : Ok
112: Miscellaneous Intel PT testing                                  : Ok
113: Test java symbol                                                : Skip
114: 'perf data convert --to-json' command test                      : Ok
115: perf script task-analyzer tests                                 : Ok
116: test perf probe of function from different CU                   : Ok
117: Check open filename arg using perf trace + vfs_getname          : Ok
#
					

Books on perf



Documentation



An example



  • Syscall latency regression
  • Uses perf record and perf stat
  • audit subsystem
  • Mostly unchanged from Oracle's UEK5 to UEK6
  • Allocator changes
  • Modifies placement of affected cachelines
  • How to recoup 100 cycles?
  • Resulted in patches to the Linux Kernel

The other tools


  • sar, top, stap, etc
  • All part of a toolchest
  • perf adds to the toolchest
  • Sometimes does the same
  • With more flexibility

Quick comparision



  • Classic tools
  • perf alternative
  • To see the various perf roles
  • Its not just a profiler

vmstat


# vmstat 1
procs -----------memory----------     -swap- ---io--- -system-- ------cpu-----
 r  b   swpd   free   buff  cache     si so  bi    bo  in    cs us sy id wa st
 8  0 2000272 2128480 483836 15791896 7  9   582  992   42   44 39  6 55  0  0
 8  0 2000272 2021504 483852 15802480 0  0   136 2520 8764 3656 84 16  0  0  0
 8  0 2000272 2113768 483876 15804436 0  0    76 2352 8740 3385 83 17  0  0  0
 8  0 2000272 1978844 483876 15812104 0  0   164  260 8657 3070 85 15  0  0  0
^C
#
					

perf stat


# perf stat --interval-print 1000 -e context-switches
#        time   counts unit events
  1.000157779    2,627      context-switches
  2.000338693    3,067      context-switches
  3.000500966    3,577      context-switches
  4.000655707    2,976      context-switches
  5.000808482    3,199      context-switches
^C 5.922384403   3,333      context-switches

#
					

perf targets


  • system wide
  • CPU
  • cgroups
  • PID
  • TID

perf stat with some threads


# perf stat -I 1000 -e cs -t `pidof -S, bash`
#        time          counts unit events
  1.000217556              12      cs
  2.000423764              11      cs
  3.000606747               1      cs
  4.000817161               1      cs
  5.001014081   <not counted>      cs
^C  5.976337203 <not counted>      cs
#
					

Counting SMIs with perf stat

press the mute key

# perf stat -I 1000 -e msr/smi/
#           time             counts unit events
     1.000532552                  0      msr/smi/
     2.001166107                  0      msr/smi/
     3.001675027                  8      msr/smi/
     4.002176974                  0      msr/smi/
     5.002726317                  8      msr/smi/
^C     5.637141913                  0      msr/smi/

#
					

Before and after

Before...
# echo 1 > /proc/sys/vm/drop_caches
# perf stat find ~acme/git/linux > /dev/null

 Performance counter stats for 'find /home/acme/git/linux':

        318.11 msec task-clock        #    0.420 CPUs utilized
         4,989      context-switches  #    0.016 M/sec
            63      cpu-migrations    #    0.198 K/sec
           280      page-faults       #    0.880 K/sec
   417,896,513      cycles            #    1.314 GHz
   379,620,453      instructions      #    0.91  insn per cycle
    75,490,217      branches          #  237.312 M/sec
     2,036,356      branch-misses     #    2.70% of all branches

   0.758165914 seconds time elapsed

   0.073100000 seconds user
   0.241968000 seconds sys
#
					

Before and after

After...
# perf stat find ~acme/git/linux > /dev/null

 Performance counter stats for 'find /home/acme/git/linux':

         68.28 msec task-clock        #    0.991 CPUs utilized
            12      context-switches  #    0.176 K/sec
             0      cpu-migrations    #    0.000 K/sec
           274      page-faults       #    0.004 M/sec
   265,617,729      cycles            #    3.890 GHz
   310,931,753      instructions      #    1.17  insn per cycle
    61,838,290      branches          #  905.609 M/sec
     1,508,769      branch-misses     #    2.44% of all branches

   0.068907995 seconds time elapsed

   0.018829000 seconds user
   0.049568000 seconds sys
#
					
Shell completion

# tools/perf/perf-completion.sh
# perf stat -e power/energy-
power/energy-cores/ power/energy-gpu/ power/energy-pkg/
power/energy-psys/ power/energy-ram/
					
PMU events

# perf stat -e power/energy-cores/,power/energy-gpu/,
               power/energy-pkg/,power/energy-psys/,
	       power/energy-ram/ sleep 1

 Performance counter stats for 'system wide':

              0.83 Joules power/energy-cores/
              0.01 Joules power/energy-gpu/
              1.66 Joules power/energy-pkg/
              5.98 Joules power/energy-psys/
              0.54 Joules power/energy-ram/

       1.001084656 seconds time elapsed

#
					

top


Tasks: 299 total,   1 running, 297 sleeping,   1 stopped,   0 zombie
%Cpu(s):  1.1 us,  0.5 sy,  0.0 ni, 98.0 id, 0.1 wa, 0.1 hi, 0.2 si, 0.0 st
MiB Mem :  23980.2 total,   6982.2 free,   4027.3 used,  12970.8 buff/cache
MiB Swap:  12056.0 total,  12056.0 free,      0.0 used.  18803.3 avail Mem

  PID USER  PR  NI    VIRT    RES    SHR S %CPU %MEM    TIME+ COMMAND
 9137 acme  20   0 2515848 839908 120176 S  3.0  3.4 12:14.64 Web Content
 8888 acme  20   0 2261904 562132 147104 S  2.6  2.3  5:24.73 Web Content
 2727 acme  20   0  689068  67888  43064 S  1.3  0.3  2:50.71 gnome-terminal-
 9093 acme  20   0 2221712 350912 121988 S  1.0  1.4  2:11.17 Web Content
  989 avahi 20   0   32776   5116   4496 S  0.7  0.0  0:16.37 avahi-daemon
 7922 root  20   0  228896   5148   4268 R  0.7  0.0  0:00.06 top
 8786 acme  20   0 2661872 425496 187100 S  0.7  1.7 13:37.34 firefox
 8959 acme  20   0 1581656 129824  82692 S  0.7  0.5  0:26.55 WebExtensions
 1996 acme   9 -11 2092664  14264  10804 S  0.3  0.1  0:04.55 pulseaudio
 2086 acme  20   0 3298140 211612 118636 S  0.3  0.9  6:36.83 gnome-shell
 7995 qemu  20   0 5256108 690860  22588 S  0.3  2.8  1:22.68 qemu-system-x86
 9050 acme  20   0 2239376 585160 116440 S  0.3  2.4  1:57.69 Web Content
29175 root  20   0  253012  30208   8636 S  0.3  0.1  0:12.01 sssd_kcm
					

perf top --hierarchy

# perf top --hierarchy
Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0
+  57.16%     cc1
+  19.39%     [kernel]
+  11.64%     libc-2.28.so
+   2.08%     perf
+   1.22%     genksyms
+   1.16%     ccache
+   1.05%     as
+   0.99%     ld-2.28.so
+   0.86%     make
+   0.84%     libpixman-1.so.0.34.0
+   0.65%     bash
+   0.41%     libelf-0.174.so
+   0.32%     objtool
+   0.24%     libglib-2.0.so.0.5800.2
+   0.17%     gcc
+   0.06%     perf-8888.map
+   0.03%     [i915]
+   0.02%     [crc32c_intel]
					

perf top --hierarchy

# perf top --hierarchy
Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0
+  61.70%     cc1
+  16.39%     [kernel]
-  11.66%     libc-2.28.so
       2.54%     [.] _int_malloc
       1.51%     [.] malloc
       1.39%     [.] _int_free
       0.72%     [.] __strcmp_avx2
       0.65%     [.] cfree@GLIBC_2.2.5
       0.51%     [.] __strlen_avx2
       0.45%     [.] __memcmp_avx2_movbe
       0.34%     [.] __memset_avx2_erms
       0.32%     [.] malloc_consolidate
       0.32%     [.] __memchr_avx2
       0.30%     [.] __memmove_avx_unaligned_erms
       0.29%     [.] __strncmp_avx2
       0.21%     [.] getenv
       0.15%     [.] __dcigettext
+   1.24%     ccache
+   0.78%     ld-2.28.so
					

perf top --hierarchy


# perf top --hierarchy
Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0 drop: 0/0
+  61.70%     cc1
-  16.39%     [kernel]
       1.13%     [k] native_flush_tlb_one_user
       0.56%     [k] unmap_page_range
       0.48%     [k] native_irq_return_iret
       0.46%     [k] clear_page_erms
       0.42%     [k] filemap_map_pages
       0.40%     [k] copy_user_enhanced_fast_string
       0.38%     [k] sync_regs
       0.37%     [k] syscall_return_via_sysret
       0.37%     [k] entry_SYSCALL_64
       0.35%     [k] common_interrupt
       0.33%     [k] error_entry
       0.31%     [k] __list_del_entry_valid
       0.30%     [k] __handle_mm_fault
       0.24%     [k] get_page_from_freelist
       0.24%     [k] __d_lookup_rcu
       0.20%     [k] vma_interval_tree_insert
       0.19%     [k] release_pages
       0.19%     [k] handle_mm_fault
       0.19%     [k] update_blocked_averages
       0.18%     [k] avc_has_perm_noaudit
       0.15%     [k] page_remove_rmap
       0.14%     [k] __pagevec_lru_add_fn
       0.14%     [k] _raw_spin_lock
+   1.29%     genksyms
+   1.24%     ccache
+   0.78%     ld-2.28.so
					
Looking for help with substrings
# perf top -h percent
 Usage: perf top []
    --call-graph 
        setup and enables call-graph (stack chain/backtrace):

          record_mode: call graph recording mode (fp|dwarf|lbr)
          record_size: if record_mode is 'dwarf', max size of stack recording ()
                               default: 8192 (bytes)
          print_type:  call graph printing style (graph|flat|fractal|folded|none)
          threshold:   minimum call graph inclusion threshold ()
          print_limit: maximum number of call graph entry ()
          order:       call graph order (caller|callee)
          sort_key:    call graph sort key (function|address)
          branch:      include last branch info to call graph (branch)
          value:       call graph value (percent|period|count)

          Default: fp,graph,0.5,caller,function
    --percent-limit 
          Don't show entries under that percent
    --percentage 
          How to display percentage of filtered entries

#
					

gdb



  • Stops arbitrary code
  • kgdb for kernel, userspace with gdb
  • Breakpoints
  • Allows reading variable values
  • Watchpoints
  • Backtraces
  • Debugging interactivity
  • ptrace

perf probe



  • Hooks arbitrary code
  • Kernel or userspace
  • kprobes, uprobes or ftrace hooks
  • No interactivity
  • Backtraces
  • no ptrace
  • Collects variables

perf probe + perf stat


# perf probe -x /lib64/libc-2.28.so malloc
                                        
Added new event:
  probe_libc:malloc    (on malloc in /usr/lib64/libc-2.28.so)
You can now use it in all perf tools, such as:

	perf record -e probe_libc:malloc -aR sleep 1
                                        
# perf stat -e probe_libc:malloc sleep 1
                                        
 Performance counter stats for 'sleep 1':

                31      probe_libc:malloc

       1.001535345 seconds time elapsed

       0.001071000 seconds user
       0.000000000 seconds sys
#
					

perf stat + perf probe



  • Defines a new event
  • Counts it like any other event
  • Event remains available
  • But inactive when not used
  • Until rebooting

Where can we put probes?


# perf probe -L icmp_rcv
...
     52         if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
                        /*
                         *      RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be
                         *        silently ignored (we let user decide with a sysctl).
                         *      RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently
                         *        discarded if to broadcast/multicast.
                         */
     59                 if ((icmph->type == ICMP_ECHO ||
     60                      icmph->type == ICMP_TIMESTAMP) &&
                            net->ipv4.sysctl_icmp_echo_ignore_broadcasts) {
                                goto error;
                        }
     64                 if (icmph->type != ICMP_ECHO &&
                            icmph->type != ICMP_TIMESTAMP &&
     66                     icmph->type != ICMP_ADDRESS &&
                            icmph->type != ICMP_ADDRESSREPLY) {
                                goto error;
                        }
                }

...
					

New probe: rx of broadcast packet



  • See the source code for the kernel
  • Works for userspace as well: uprobes
  • With offsets in the lines
  • Where probes can be put

Adding the probe


# perf probe icmp_rcv:59
Added new events:
  probe:icmp_rcv       (on icmp_rcv:59)
  probe:icmp_rcv_1     (on icmp_rcv:59)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv_1 -aR sleep 1

#
					

Tracing those events


# On window 1:
# ping -c 2 -b 127.255.255.255
WARNING: pinging broadcast address
PING 127.255.255.255 (127.255.255.255) 56(84) bytes of data.
--- 127.255.255.255 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 54ms
#

# On window 2:
# perf trace -e probe:icmp_rcv*
     0.000 ping/2126 probe:icmp_rcv_1:(ffffffff9186a9d2)
     0.002 ping/2126 probe:icmp_rcv:(ffffffff9186a9d2)
  1052.532 ping/2126 probe:icmp_rcv_1:(ffffffff9186a9d2)
  1052.534 ping/2126 probe:icmp_rcv:(ffffffff9186a9d2)
^C#
					
# perf trace --call-graph=dwarf -e probe:icmp_rcv*
     0.000 ping/28036 probe:icmp_rcv_1:(ffffffff9186a9d2)
                         ip_local_deliver_finish ([kernel.kallsyms])
                         ip_local_deliver ([kernel.kallsyms])
                         ip_rcv ([kernel.kallsyms])
                         __netif_receive_skb_one_core ([kernel.kallsyms])
                         process_backlog ([kernel.kallsyms])
                         net_rx_action ([kernel.kallsyms])
                         do_softirq ([kernel.kallsyms])
                         __local_bh_enable_ip ([kernel.kallsyms])
                         ip_finish_output2 ([kernel.kallsyms])
                         ip_output ([kernel.kallsyms])
                         ip_send_skb ([kernel.kallsyms])
                         raw_sendmsg ([kernel.kallsyms])
                         sock_sendmsg ([kernel.kallsyms])
                         __sys_sendto ([kernel.kallsyms])
                         do_syscall_64 ([kernel.kallsyms])
                         entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                         __libc_sendto (/usr/lib64/libc-2.28.so)
                         ping4_send_probe (/usr/bin/ping)
                         pinger (/usr/bin/ping)
                         main_loop (/usr/bin/ping)
                         ping4_run (/usr/bin/ping)
                         main (/usr/bin/ping)
                         __libc_start_main (/usr/lib64/libc-2.28.so)
                         _start (/usr/bin/ping)
					

perf diff



  • diff like
  • multiple versions of a software
  • What changed?
  • buildids

Where do I start?



  • perf top: live profiling
  • perf top --hierarchy
  • --sort
  • Integration with annotate
  • live annotate
  • Annotate navigation

live profiling



  • perf top
  • top like
  • default to symbols
  • glimpse at the system
  • symbol names help a bit
Building the kernel on ARM64

Counting:



  • perf stat
  • lowest cost
  • Many events
  • No symbol resolution

perf stat


$ perf stat sleep 1

 Performance counter stats for 'sleep 1':

         1.42 msec task-clock:u         #    0.001 CPUs utilized
            0      context-switches:u   #    0.000 /sec
            0      cpu-migrations:u     #    0.000 /sec
           78      page-faults:u        #   54.934 K/sec
      708,831      cycles:u             #    0.499 GHz
      476,967      instructions:u       #    0.67  insn per cycle
      106,465      branches:u           #   74.982 M/sec
       11,030      branch-misses:u      #   10.36% of all branches

  1.002618642 seconds time elapsed

  0.000000000 seconds user
  0.002129000 seconds sys
$
					

system wide


$ sudo perf stat -a sleep 1

 Performance counter stats for 'system wide':

     8,017.36 msec cpu-clock            #    7.998 CPUs utilized
        1,379      context-switches     #  172.002 /sec
           19      cpu-migrations       #    2.370 /sec
          603      page-faults          #   75.212 /sec
  221,872,379      cycles               #    0.028 GHz
   90,264,445      instructions         #    0.41  insn per cycle
   18,078,489      branches             #    2.255 M/sec
    1,134,291      branch-misses        #    6.27% of all branches

  1.002463688 seconds time elapsed
$
$ getconf _NPROCESSORS_ONLN
8
$
					

What can be counted?



  • perf list
  • perf list hw, sw, *wildcard*

hardware counters


# grep 'model name' -m1 /proc/cpuinfo
model name	: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
                                        
# perf list hw

List of pre-defined events (to be used in -e or -M):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
#
					

Vendor events



  • What is a "cache-misses" event?
  • Depends on the processor
  • May even be different per family
  • Attempt at comparing different arches

Intel skylake


# perf stat -e cache-misses,LONGEST_LAT_CACHE.MISS
^C
 Performance counter stats for 'system wide':

        13,261,973      cache-misses                                                          
        13,262,361      LONGEST_LAT_CACHE.MISS                                                

       3.612974091 seconds time elapsed

# perf list longest_lat_cache.miss

cache:
 longest_lat_cache.miss
   [Core-originated cacheable demand requests missed L3 Spec update: SKL057]
#
					

kernel sources


$ git grep PERF_COUNT_HW_CACHE_MISSES arch/ | tail
arch/x86/events/amd/core.c:	[PERF_COUNT_HW_CACHE_MISSES]		= 0x0964,
arch/x86/events/intel/core.c:	[PERF_COUNT_HW_CACHE_MISSES]		= 0x412e,
arch/x86/events/intel/core.c:	{ PERF_COUNT_HW_CACHE_MISSES, "cache misses" },
arch/x86/events/intel/knc.c:  [PERF_COUNT_HW_CACHE_MISSES]		= 0x0029,
arch/x86/events/intel/p4.c:  [PERF_COUNT_HW_CACHE_MISSES] =
arch/x86/events/intel/p6.c:  [PERF_COUNT_HW_CACHE_MISSES]		= 0x012e,	/* L2_RQSTS:I       */
arch/x86/events/zhaoxin/core.c:	[PERF_COUNT_HW_CACHE_MISSES]      = 0x051a,
arch/x86/events/zhaoxin/core.c:	{ PERF_COUNT_HW_CACHE_MISSES, "cache misses" },
arch/x86/events/zhaoxin/core.c:			zx_pmon_event_map[PERF_COUNT_HW_CACHE_MISSES] = 0;
arch/xtensa/kernel/perf_event.c:	[PERF_COUNT_HW_CACHE_MISSES]		= XTENSA_PMU_MASK(12, 0x1),
$
					

raw events: umask, event


# perf stat -a -e cache-misses,r412e sleep 5s

 Performance counter stats for 'system wide':

        12,961,754      cache-misses                                                          
        12,962,089      r412e                                                                 

       5.003567616 seconds time elapsed

#
					
Vendor Events/--long-desc
tools/perf/pmu-events/arch/x86/skylake/cache.json
    {
        "BriefDescription": "Core-originated cacheable demand requests missed L3",
        "Errata": "SKL057",
        "EventCode": "0x2E",
        "EventName": "LONGEST_LAT_CACHE.MISS",
        "PublicDescription": "Counts core-originated cacheable requests that miss the L3 cache (Longest Latency cache).
	Requests include data and code reads, Reads-for-Ownership (RFOs), speculative accesses and hardware
	prefetches from L1 and L2. It does not include all misses to the L3.",
        "SampleAfterValue": "100003",
        "UMask": "0x41"
    },
					

3 ways of measuring


# perf stat -a -e cache-misses,r412e,LONGEST_LAT_CACHE.MISS sleep 5s

 Performance counter stats for 'system wide':

        11,621,661      cache-misses                                                          
        11,621,982      r412e                                                                 
        11,622,200      LONGEST_LAT_CACHE.MISS                                                

       5.003476145 seconds time elapsed

#
					

cache events


# perf list cache

List of pre-defined events (to be used in -e or -M):

cpu:
  L1-dcache-loads OR cpu/L1-dcache-loads/
  L1-dcache-load-misses OR cpu/L1-dcache-load-misses/
  L1-dcache-stores OR cpu/L1-dcache-stores/
  L1-icache-load-misses OR cpu/L1-icache-load-misses/
  LLC-loads OR cpu/LLC-loads/
  LLC-load-misses OR cpu/LLC-load-misses/
  LLC-stores OR cpu/LLC-stores/
  LLC-store-misses OR cpu/LLC-store-misses/
  dTLB-loads OR cpu/dTLB-loads/
  dTLB-load-misses OR cpu/dTLB-load-misses/
  dTLB-stores OR cpu/dTLB-stores/
  dTLB-store-misses OR cpu/dTLB-store-misses/
  iTLB-loads OR cpu/iTLB-loads/
  iTLB-load-misses OR cpu/iTLB-load-misses/
  branch-loads OR cpu/branch-loads/
  branch-load-misses OR cpu/branch-load-misses/
  node-loads OR cpu/node-loads/
  node-load-misses OR cpu/node-load-misses/
  node-stores OR cpu/node-stores/
  node-store-misses OR cpu/node-store-misses/
#
					

HW differences



  • Not all CPUs has all events
  • AMD
  • Intel

cache: AMD


# grep -m1 'model name' /proc/cpuinfo
model name	: AMD Ryzen 9 5950X 16-Core Processor
                                        
# perf list cache

List of pre-defined events (to be used in -e or -M):

cpu:
  L1-dcache-loads OR cpu/L1-dcache-loads/
  L1-dcache-load-misses OR cpu/L1-dcache-load-misses/
  L1-dcache-prefetches OR cpu/L1-dcache-prefetches/
  L1-icache-loads OR cpu/L1-icache-loads/
  L1-icache-load-misses OR cpu/L1-icache-load-misses/
  dTLB-loads OR cpu/dTLB-loads/
  dTLB-load-misses OR cpu/dTLB-load-misses/
  iTLB-loads OR cpu/iTLB-loads/
  iTLB-load-misses OR cpu/iTLB-load-misses/
  branch-loads OR cpu/branch-loads/
  branch-load-misses OR cpu/branch-load-misses/
#
					

cache: ARM64


root@roc-rk3399-pc:~# uname -m
aarch64
                                        
root@roc-rk3399-pc:~# perf list cache

List of pre-defined events (to be used in -e or -M):

armv8_cortex_a53:
  L1-dcache-loads OR armv8_cortex_a53/L1-dcache-loads/
  L1-dcache-load-misses OR armv8_cortex_a53/L1-dcache-load-misses/
  L1-dcache-stores OR armv8_cortex_a53/L1-dcache-stores/
  L1-dcache-store-misses OR armv8_cortex_a53/L1-dcache-store-misses/
  L1-dcache-prefetch-misses OR armv8_cortex_a53/L1-dcache-prefetch-misses/
  L1-icache-loads OR armv8_cortex_a53/L1-icache-loads/
  L1-icache-load-misses OR armv8_cortex_a53/L1-icache-load-misses/
  dTLB-load-misses OR armv8_cortex_a53/dTLB-load-misses/
  dTLB-store-misses OR armv8_cortex_a53/dTLB-store-misses/
  iTLB-load-misses OR armv8_cortex_a53/iTLB-load-misses/
  branch-loads OR armv8_cortex_a53/branch-loads/
  branch-load-misses OR armv8_cortex_a53/branch-load-misses/
  node-loads OR armv8_cortex_a53/node-loads/
  node-stores OR armv8_cortex_a53/node-stores/
					

cache: ARM64: Hybrid!


armv8_cortex_a72:
  L1-dcache-loads OR armv8_cortex_a72/L1-dcache-loads/
  L1-dcache-load-misses OR armv8_cortex_a72/L1-dcache-load-misses/
  L1-dcache-stores OR armv8_cortex_a72/L1-dcache-stores/
  L1-dcache-store-misses OR armv8_cortex_a72/L1-dcache-store-misses/
  L1-dcache-prefetch-misses OR armv8_cortex_a72/L1-dcache-prefetch-misses/
  L1-icache-loads OR armv8_cortex_a72/L1-icache-loads/
  L1-icache-load-misses OR armv8_cortex_a72/L1-icache-load-misses/
  dTLB-load-misses OR armv8_cortex_a72/dTLB-load-misses/
  dTLB-store-misses OR armv8_cortex_a72/dTLB-store-misses/
  iTLB-load-misses OR armv8_cortex_a72/iTLB-load-misses/
  branch-loads OR armv8_cortex_a72/branch-loads/
  branch-load-misses OR armv8_cortex_a72/branch-load-misses/
  node-loads OR armv8_cortex_a72/node-loads/
  node-stores OR armv8_cortex_a72/node-stores/
root@roc-rk3399-pc:~#
					

JSON vendor events


$ ls tools/perf/pmu-events/arch/
arm64  nds32  powerpc  riscv  s390  test  x86
                                        
$ ls tools/perf/pmu-events/arch/x86/
alderlake    amdzen1        amdzen3       bonnell        broadwellde cascadelakex
goldmont     grandridge     haswell       icelake        ivybridge   jaketown
mapfile.csv  nehalemep      rocketlake    sapphirerapids silvermont  skylakex
tigerlake    westmereep-sp  alderlaken    amdzen2        amdzen4     broadwell
broadwellx   elkhartlake    goldmontplus  graniterapids haswellx     icelakex
ivytown      knightslanding meteorlake    nehalemex     sandybridge  sierraforest
skylake      snowridgex     westmereep-dp westmereex
                                        
$ ls tools/perf/pmu-events/arch/riscv/
mapfile.csv  riscv-sbi-firmware.json  sifive
$ ls tools/perf/pmu-events/arch/riscv/sifive/
u74
$ ls tools/perf/pmu-events/arch/riscv/sifive/u74/
firmware.json  instructions.json  memory.json  microarch.json
$
					

An example


$ head -6 tools/perf/pmu-events/arch/riscv/sifive/u74/instructions.json
[
  {
    "EventName": "EXCEPTION_TAKEN",
    "EventCode": "0x0000100",
    "BriefDescription": "Exception taken"
  },
$
					

perf list on it


# perf list dtlb | head -20

virtual memory:
  dtlb_load_misses.miss_causes_a_walk
       [Load misses in all DTLB levels that cause page walks]
  dtlb_load_misses.stlb_hit
       [Loads that miss the DTLB and hit the STLB]
  dtlb_load_misses.walk_active
       [Cycles when at least one PMH is busy with a page walk for a load. EPT
        page walk duration are excluded in Skylake]
  dtlb_load_misses.walk_completed
       [Load miss in all TLB levels causes a page walk that completes. (All
        page sizes)]
  dtlb_load_misses.walk_completed_1g
       [Page walk completed due to a demand data load to a 1G page]
  dtlb_load_misses.walk_completed_2m_4m
       [Page walk completed due to a demand data load to a 2M/4M page]
  dtlb_load_misses.walk_completed_4k
       [Page walk completed due to a demand data load to a 4K page]
  dtlb_load_misses.walk_pending
       [Counts 1 per cycle for each PMH that is busy with a page walk for a
#
					

Targets



  • Start workload
  • Existing pid/tid
  • CPUs
  • cgroups
  • user

workload

# perf stat stress-ng --cpu 8 --timeout 5
                                        
stress-ng: info:  [530937] setting to a 5 secs run per stressor
stress-ng: info:  [530937] dispatching hogs: 8 cpu
stress-ng: info:  [530937] cpu:
stress-ng: info:  [530937]       1921 Thermal Event Interrupts
stress-ng: warn:  [530937]          1 System Management Interrupt
stress-ng: info:  [530937] skipped: 0
stress-ng: info:  [530937] passed: 8: cpu (8)
stress-ng: info:  [530937] failed: 0
stress-ng: info:  [530937] metrics untrustworthy: 0
stress-ng: info:  [530937] successful run completed in 5.01 secs
                                        
# perf stat stress-ng --cpu 8 --timeout 5

 Performance counter stats for 'stress-ng --cpu 8 --timeout 5':

         38,758.27 msec task-clock              #    7.726 CPs utilized
             2,926      context-switches        #   75.494 /sec
                23      cpu-migrations          #    0.593 /sec
             4,602      page-faults             #  118.736 /sec
   146,703,471,275      cycles                  #    3.785 GHz
   180,899,175,476      instructions            #    1.23  insn per cycle
    20,292,093,164      branches                #  523.555 M/sec
       298,795,750      branch-misses           #    1.47% of all branches

       5.016468088 seconds time elapsed

      38.552115000 seconds user
       0.103353000 seconds sys

#
					
# perf stat -d stress-ng --cpu 8 --timeout 5

 Performance counter stats for 'stress-ng --cpu 8 --timeout 5':

         39,764.39 msec task-clock              #    7.926 CPUs utilized
             1,979      context-switches        #   49.768 /sec
                 1      cpu-migrations          #    0.025 /sec
             4,601      page-faults             #  115.707 /sec
   150,499,468,551      cycles                  #    3.785 GHz                         (50.02%)
   183,080,730,725      instructions            #    1.22  insn per cycle              (62.50%)
    20,508,317,865      branches                #  515.746 M/sec                       (62.49%)
       298,939,900      branch-misses           #    1.46% of all branches             (62.50%)
    23,971,178,026      L1-dcache-loads         #  602.830 M/sec                       (62.49%)
     2,853,927,928      L1-dcache-load-misses   #   11.91% of all L1-dcache accesses   (62.50%)
       990,690,295      LLC-loads               #   24.914 M/sec                       (50.02%)
           429,819      LLC-load-misses         #    0.04% of all L1-icache accesses   (50.02%)

       5.017089011 seconds time elapsed

      39.392300000 seconds user
       0.057596000 seconds sys

#
					
# perf stat -dd stress-ng --cpu 8 --timeout 5

 Performance counter stats for 'stress-ng --cpu 8 --timeout 5':

         39,825.52 msec task-clock              #    7.938 CPUs utilized
             1,889      context-switches        #   47.432 /sec
                 4      cpu-migrations          #    0.100 /sec
             4,610      page-faults             #  115.755 /sec
   151,191,351,560      cycles                  #    3.796 GHz                         (30.77%)
   180,391,636,461      instructions            #    1.19  insn per cycle              (38.49%)
    20,290,955,975      branches                #  509.496 M/sec                       (38.50%)
       291,480,930      branch-misses           #    1.44% of all branches             (38.49%)
    23,643,871,036      L1-dcache-loads         #  593.686 M/sec                       (38.49%)
     2,699,885,182      L1-dcache-load-misses   #   11.42% of all L1-dcache accesses   (38.50%)
       987,819,442      LLC-loads               #   24.804 M/sec                       (30.79%)
           440,627      LLC-load-misses         #    0.04% of all L1-icache accesses   (30.78%)
         L1-icache-loads
       203,971,595      L1-icache-load-misses                                          (30.76%)
    23,864,128,198      dTLB-loads              #  599.217 M/sec                       (30.77%)
           200,750      dTLB-load-misses        #    0.00% of all dTLB cache accesses  (30.76%)
         1,452,617      iTLB-loads              #   36.475 K/sec                       (30.76%)
           554,741      iTLB-load-misses        #   38.19% of all iTLB cache accesses  (30.78%)

       5.017276264 seconds time elapsed

      39.447619000 seconds user
       0.066076000 seconds sys

#
					
# perf stat -ddd stress-ng --cpu 8 --timeout 5

 Performance counter stats for 'stress-ng --cpu 8 --timeout 5':

         39,881.34 msec task-clock              #    7.946 CPUs utilized
             1,629      context-switches        #   40.846 /sec
                 1      cpu-migrations          #    0.025 /sec
             4,581      page-faults             #  114.866 /sec
   145,105,355,853      cycles                  #    3.638 GHz                         (30.80%)
   171,764,585,477      instructions            #    1.18  insn per cycle              (38.48%)
    19,247,245,672      branches                #  482.613 M/sec                       (38.48%)
       274,979,345      branch-misses           #    1.43% of all branches             (38.48%)
    22,309,648,482      L1-dcache-loads         #  559.401 M/sec                       (38.49%)
     2,538,566,841      L1-dcache-load-misses   #   11.38% of all L1-dcache accesses   (38.48%)
       990,773,774      LLC-loads               #   24.843 M/sec                       (30.78%)
           442,362      LLC-load-misses         #    0.04% of all L1-icache accesses   (30.77%)
         L1-icache-loads
       178,304,864      L1-icache-load-misses                                          (30.75%)
    22,477,712,511      dTLB-loads              #  563.615 M/sec                       (30.77%)
           210,271      dTLB-load-misses        #    0.00% of all dTLB cache accesses  (30.78%)
         1,827,194      iTLB-loads              #   45.816 K/sec                       (30.80%)
           335,946      iTLB-load-misses        #   18.39% of all iTLB cache accesses  (30.80%)
         L1-dcache-prefetches
         L1-dcache-prefetch-misses

       5.018978942 seconds time elapsed

      39.499595000 seconds user
       0.053139000 seconds sys

#
					

Newer Intel machines


$ perf stat sleep 1

 Performance counter stats for 'sleep 1':

              0.51 msec task-clock:u              #    0.001 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                68      page-faults:u             #  132.683 K/sec
           545,228      cycles:u                  #    1.064 GHz
           555,509      instructions:u            #    1.02  insn per cycle
           123,574      branches:u                #  241.120 M/sec
             6,957      branch-misses:u           #    5.63% of all branches
                        TopdownL1                 #     17.5 %  tma_backend_bound
                                                  #     22.6 %  tma_bad_speculation
                                                  #     42.7 %  tma_frontend_bound
                                                  #     17.1 %  tma_retiring
                        TopdownL2                 #     21.8 %  tma_branch_mispredicts
                                                  #     11.5 %  tma_core_bound
                                                  #     13.4 %  tma_fetch_bandwidth
                                                  #     29.3 %  tma_fetch_latency
                                                  #      2.7 %  tma_heavy_operations
                                                  #     14.5 %  tma_light_operations
                                                  #      0.8 %  tma_machine_clears
                                                  #      6.1 %  tma_memory_bound
					

CPU target



  • CPU isolation
  • To avoid interference
  • From OS kernel
  • And other apps
  • Real Time

Idle CPU target


# perf stat -e cycles,instructions -C 1 sleep 1

 Performance counter stats for 'CPU(s) 1':

        14,490,458      cycles
         6,429,339      instructions     #    0.44  insn per cycle

       1.002107055 seconds time elapsed

#
					

Busy CPU target


# taskset --cpu-list 1 stress-ng --cpu 1
stress-ng: info:  [531903] defaulting to a 1 day, 0 secs run per stressor
stress-ng: info:  [531903] dispatching hogs: 1 cpu
                                        
# taskset --cpu-list -p 531903
pid 531903's current affinity list: 1
#
					

Busy CPU target


# perf stat -e cycles,instructions --cpu 1 sleep 1

 Performance counter stats for 'CPU(s) 1':

     3,968,302,827      cycles
     6,913,940,181      instructions     #    1.74  insn per cycle

       1.001263107 seconds time elapsed

#
					

PID target


# perf stat -e cycles,instructions -p 531904 sleep 1

 Performance counter stats for process id '531904':

     3,976,708,553      cycles
     6,937,604,548      instructions    #    1.74  insn per cycle

       1.001302101 seconds time elapsed

#
					

Interval printing


# perf stat --interval-print 1000 -e cycles,instructions -p 531904
#       time         counts unit events
 1.001021154  3,964,676,448      cycles
 1.001021154  6,892,761,344      instructions  # 1.74 insn per cycle
 2.002120190  3,967,753,676      cycles
 2.002120190  6,910,658,758      instructions  # 1.74 insn per cycle
 3.002967437  3,971,391,822      cycles
 3.002967437  6,932,051,939      instructions  # 1.75 insn per cycle
 4.004059795  3,974,283,493      cycles
 4.004059795  6,907,156,529      instructions  # 1.74 insn per cycle
 5.004964383  3,968,905,089      cycles
 5.004964383  6,909,361,881      instructions  # 1.74 insn per cycle
 6.006058613  3,975,937,140      cycles
 6.006058613  6,935,675,487      instructions  # 1.74 insn per cycle
 7.006967950  3,979,759,663      cycles
^C 7.006967950  6,938,796,949      instructions  # 1.74 insn per cycle
#
					

cgroup target


# cat /proc/531903/cgroup
0::/user.slice/user-1000.slice/session-38.scope
                                        
# ps -eo pid,args,pcpu,cgroup --sort user | grep session-38
 531503 sshd: acme@pts/7     0.0 0::/user.slice/user-1000.slice/session-38.scope
 531516 -bash                0.0 0::/user.slice/user-1000.slice/session-38.scope
 531496 sshd: acme [priv]    0.0 0::/user.slice/user-1000.slice/session-38.scope
 531636 sudo su -            0.0 0::/user.slice/user-1000.slice/session-38.scope
 531637 su -                 0.0 0::/user.slice/user-1000.slice/session-38.scope
 531640 -bash                0.0 0::/user.slice/user-1000.slice/session-38.scope
 531903 stress-ng --cpu 1    0.0 0::/user.slice/user-1000.slice/session-38.scope
 531904 stress-ng-cpu [run] 99.8 0::/user.slice/user-1000.slice/session-38.scope
#
					

cgroup target


# perf stat -I 1000 -e cycles,instructions -G user.slice/user-1000.slice/session-38.scope
#       time         counts unit events
 1.000102472  3,959,963,051      cycles         user.slice/user-1000.slice/session-38.scope
 1.000102472  6,891,303,376      instructions   user.slice/user-1000.slice/session-38.scope # 1.74  insn per cycle
 2.001681379  3,964,485,454      cycles         user.slice/user-1000.slice/session-38.scope
 2.001681379  6,919,617,821      instructions   user.slice/user-1000.slice/session-38.scope # 1.75  insn per cycle
 3.003042309  3,967,487,922      cycles         user.slice/user-1000.slice/session-38.scope
 3.003042309  6,916,449,901      instructions   user.slice/user-1000.slice/session-38.scope # 1.74  insn per cycle
 4.004647780  3,961,207,761      cycles         user.slice/user-1000.slice/session-38.scope
 4.004647780  6,886,653,488      instructions   user.slice/user-1000.slice/session-38.scope # 1.74  insn per cycle
 5.006216894  3,970,623,638      cycles         user.slice/user-1000.slice/session-38.scope
 5.006216894  6,927,468,227      instructions   user.slice/user-1000.slice/session-38.scope # 1.74  insn per cycle
^C 5.332583270  1,290,202,842      cycles         user.slice/user-1000.slice/session-38.scope
 5.332583270  2,245,390,805      instructions   user.slice/user-1000.slice/session-38.scope # 1.74  insn per cycle
#
					

BPF target



  • BPF popping up everywhere
  • bpftool prog to list
  • Where?
  • Can I see the source?
  • Little digression, we'll see more later

bpftool prog


# bpftool prog | head -8
2: tracing  name hid_tail_call  tag 7cc47bbf07148bfe  gpl
	loaded_at 2023-09-04T13:07:19-0300  uid 0
	xlated 56B  jited 113B  memlock 4096B  map_ids 2
	btf_id 2
47: lsm  name restrict_filesy  tag aae89fa01fe7ee91  gpl
	loaded_at 2023-09-04T13:07:25-0300  uid 0
	xlated 560B  jited 305B  memlock 4096B  map_ids 24
	btf_id 60
#
					

restrict_filesy?



  • LSM: Linux Security Module
  • Has BTF info
  • Look at source

source


# bpftool prog dump xlated id 47 | head -15
int restrict_filesystems(unsigned long long * ctx):
; int BPF_PROG(restrict_filesystems, struct file *file, int ret)
   0: (79) r3 = *(u64 *)(r1 +0)
   1: (79) r0 = *(u64 *)(r1 +8)
   2: (b7) r1 = 0
; uint32_t *value, *magic_map, magic_number, zero = 0, *is_allow;
   3: (63) *(u32 *)(r10 -24) = r1
; int BPF_PROG(restrict_filesystems, struct file *file, int ret)
   4: (bf) r1 = r0
   5: (67) r1 <<= 32
   6: (77) r1 >>= 32
; if (ret != 0)
   7: (55) if r1 != 0x0 goto pc+61
   8: (b7) r1 = 32
   9: (0f) r3 += r1
#
					

Back to perf stat, advanced mode



  • targets: CPU, threads, processes, cgroups, BPF programs, etc.
  • metrics & metricgroups: equations with multiple events
  • simplest one: IPC (Instructions per Cycle)

profiling/tracing workflow



  • perf record in one machine
  • perf report/annotate/script/etc on other
  • perf archive to collect needed ELF files
  • cross platform
  • good for IoT, embedded, etc
  • pipe mode
  • perf.data file
  • documentation of perf.data file

record


# mkdir test
# cd test
# ls
# perf record sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.034 MB perf.data (8 samples) ]
# ls
perf.data
#
					

what is in it?


# perf evlist
cycles:P
# perf evlist -v
cycles:P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES),
{ sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD,
read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1,
task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
#
					

defaults



  • event: CPU cycles
  • cycles:P the most precise cycles event
  • sample skew
  • precise_ip: 3
  • frequency: 4kHz
  • freq: 1 - use freq, not period

period



  • period: number of events to trigger a sample
  • A small value may generate too much data
  • Experiment until getting a reasonable value
  • Eg. 'perf record -e cache-misses/period=10000/'
  • When using freq, the kernel adjusts the period
  • To achieve the frequency desired
  • perf_event_open(2)

header


# perf report --header-only
# captured on    : Fri Sep 15 16:38:23 2023
# header version : 1
# data offset    : 320
# data size      : 35520
# feat offset    : 35840
# hostname : quaco
# os release : 6.4.10-200.fc38.x86_64
# perf version : 6.6.rc1.gf6ff1c760431
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
# cpuid : GenuineIntel,6,142,10
# total memory : 24471460 kB
# cmdline : /home/acme/bin/perf record sleep 1
# event : name = cycles:P type = 0 (PERF_TYPE_HARDWARE),
	 size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES) ...
					

more header info


# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, breakpoint = 5, cstate_core = 17,
                cstate_pkg = 18, i915 = 10, intel_pt = 8,
		kprobe = 6, msr = 9, power = 19 ...
# CACHE info available, use -I to display
# time of first sample : 388144.384852
# time of last sample : 388144.385473
					

even more


# bpf_prog_info 702: bpf_prog_6deef7357e7b4530_sd_fw_egress addr 0xffffffffc0239ff4 size 63
# bpf_prog_info 703: bpf_prog_6deef7357e7b4530_sd_fw_ingress addr 0xffffffffc023a06c size 63
# bpf_prog_info 704: bpf_prog_ccbbf91f3c6979c7_sd_devices addr 0xffffffffc0237a0c size 360
# btf info of id 2
# btf info of id 60
# cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
# intel_pt pmu capabilities: topa_multiple_entries=1, psb_cyc=1, single_range_output=1,
  mtc_periods=249, ip_filtering=1, output_subsys=0 ...
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT
  CLOCKID DIR_FORMAT COMPRESSED CLOCK_DATA HYBRID_TOPOLOGY
					

-I CPU info


# sibling threads : 0,4
# sibling threads : 1,5
# sibling threads : 2,6
# sibling threads : 3,7
# CPU 0: Core ID 0, Die ID 0, Socket ID 0
# CPU 1: Core ID 1, Die ID 0, Socket ID 0
# CPU 2: Core ID 2, Die ID 0, Socket ID 0
# CPU 3: Core ID 3, Die ID 0, Socket ID 0
# CPU 4: Core ID 0, Die ID 0, Socket ID 0
# CPU 5: Core ID 1, Die ID 0, Socket ID 0
# CPU 6: Core ID 2, Die ID 0, Socket ID 0
# CPU 7: Core ID 3, Die ID 0, Socket ID 0
					

-I: caches


# CPU cache info:
#  L1 Data                 32K [0,4]
#  L1 Instruction          32K [0,4]
#  L1 Data                 32K [1,5]
#  L1 Instruction          32K [1,5]
#  L1 Data                 32K [2,6]
#  L1 Instruction          32K [2,6]
#  L1 Data                 32K [3,7]
#  L1 Instruction          32K [3,7]
#  L2 Unified             256K [0,4]
#  L2 Unified             256K [1,5]
#  L2 Unified             256K [2,6]
#  L2 Unified             256K [3,7]
#  L3 Unified            8192K [0-7]
					

Consuming it


# perf report --tasks
#      pid      tid     ppid  comm
         0        0       -1 |swapper
    711036   711036       -1 |sleep
#
					

mmaps


# perf report --mmaps
#      pid      tid     ppid  comm
         0        0       -1 |swapper
    711036   711036       -1 |sleep
                                55bb9d3b1000-55bb9d3b5000 r-xs 00002000 1969023 /usr/bin/sleep
                                7f070155e000-7f07016bb000 r-xs 00026000 1985734 /usr/lib64/libc.so.6
                                7f0701733000-7f0701759000 r-xs 00001000 1985730 /usr/lib64/ld-linux-x86-64.so.2
                                7ffd4cdd3000-7ffd4cdd5000 r-xs 00000000 0 [vdso]
#
					

system wide


# perf record -a -e '{cycles,cache-misses}' sleep 2s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.378 MB perf.data (6666 samples) ]
#
					

default report


# perf report --stdio
# Total Lost Samples: 0
#
# Samples: 6K of events 'anon group { cycles, cache-misses }'
# Event count (approx.): 647299225
#       Overhead  Command          Shared Object     Symbol
   9.73%   1.55%  swapper          [kernel.vmlinux]  [k] intel_idle_ibrs
   2.45%   5.96%  Isolated Web Co  libxul.so         [.] mozilla::EventListenerManager::MarkForCC()
   0.98%   0.44%  swapper          [kernel.vmlinux]  [k] psi_group_change
   0.72%   0.56%  swapper          [kernel.vmlinux]  [k] update_load_avg
   0.65%   0.19%  swapper          [kernel.vmlinux]  [k] timerqueue_add
   0.59%   0.50%  swapper          [kernel.vmlinux]  [k] cpuidle_enter_state
   0.58%   0.32%  swapper          [kernel.vmlinux]  [k] menu_select
   0.56%   0.00%  perf-exec        [kernel.vmlinux]  [k] release_pages
   0.55%   0.26%  swapper          [kernel.vmlinux]  [k] switch_mm_irqs_off
					

number of samples


# perf report --show-nr-samples --stdio | head -20
# Total Lost Samples: 0
#
# Samples: 6K of events 'anon group { cycles, cache-misses }'
# Event count (approx.): 647299225
#
#    Overhead  Samples  Command          Shared Object      Symbol
 9.73%  1.55%  493  44  swapper          [kernel.vmlinux]   [k] intel_idle_ibrs
 2.45%  5.96%   70  71  Isolated Web Co  libxul.so          [.] mozilla::EventListenerManager::MarkForCC()
 0.98%  0.44%   53  12  swapper          [kernel.vmlinux]   [k] psi_group_change
 0.72%  0.56%   45  16  swapper          [kernel.vmlinux]   [k] update_load_avg
 0.65%  0.19%   37   7  swapper          [kernel.vmlinux]   [k] timerqueue_add
 0.59%  0.50%   36  13  swapper          [kernel.vmlinux]   [k] cpuidle_enter_state
 0.58%  0.32%   36   7  swapper          [kernel.vmlinux]   [k] menu_select
 0.56%  0.00%    1   0  perf-exec        [kernel.vmlinux]   [k] release_pages
 0.55%  0.26%   22   9  swapper          [kernel.vmlinux]   [k] switch_mm_irqs_off
					

sort order


# perf report --show-nr-samples --sort dso --stdio | head -30
# Total Lost Samples: 0
#
# Samples: 6K of events 'anon group { cycles, cache-misses }'
# Event count (approx.): 647299225
#
#      Overhead    Samples  Shared Object
 52.27%  42.98%  2582 1242  [kernel.vmlinux]
 20.74%  26.64%   671  568  libxul.so
  8.88%   8.96%   210  189  [JIT] tid 7566
  5.46%   5.22%   210  139  libc.so.6
  3.99%   4.49%   146  105  firefox
  2.33%   2.03%    55   48  [JIT] tid 7578
  1.00%   1.04%    37   24  libglib-2.0.so.0.7600.4 (deleted)
  0.56%   0.12%    14    6  libfdk-aac.so.2.0.0
  0.43%   0.22%    16    4  [i915]
  0.41%   1.16%    24   34  libglib-2.0.so.0.7600.5
  0.38%   0.42%    16   10  libsystemd-shared-253.7-1.fc38.so
  0.37%   0.29%    18    8  [vdso]
  0.34%   0.41%    14    8  libnspr4.so (deleted)
  0.28%   0.41%    15   10  ld-linux-x86-64.so.2
  0.18%   0.11%     5    2  libmutter-12.so.0.0.0 (deleted)
  0.15%   0.33%     7    6  [JIT] tid 8358
  0.14%   0.36%    11   12  libwebkit2gtk-4.1.so.0.8.5
  0.13%   0.13%     5    3  libpixman-1.so.0.42.2
  0.13%   0.16%    10    5  libjavascriptcoregtk-4.1.so.0.3.14
#
					

sort orders


# perf report --stdio --sort bla
Error:
Unknown --sort key: `bla'
 Usage: perf report []

  -s, --sort 
      sort by key(s): overhead overhead_sys overhead_us overhead_guest_sys
      overhead_guest_us overhead_children sample period
      pid comm dso symbol parent cpu socket srcline srcfile
      local_weight weight transaction trace symbol_size
      dso_size cgroup cgroup_id ipc_null time code_page_size
      local_ins_lat ins_lat local_p_stage_cyc p_stage_cyc
      addr local_retire_lat retire_lat simd dso_from dso_to
      symbol_from symbol_to mispredict abort in_tx cycles
      srcline_from srcline_to ipc_lbr addr_from addr_to
      symbol_daddr dso_daddr locked tlb mem snoop dcacheline
      symbol_iaddr phys_daddr data_page_size blocked
#
					

dso,sym


# perf report --sort dso,sym --stdio | head -30
# Total Lost Samples: 0
#
# Samples: 6K of events 'anon group { cycles, cache-misses }'
# Event count (approx.): 647299225
#
#     Overhead  Shared Object      Symbol
  9.73%  1.55%  [kernel.vmlinux]   [k] intel_idle_ibrs
  2.45%  5.96%  libxul.so          [.] mozilla::EventListenerManager::MarkForCC()
  1.41%  0.16%  [kernel.vmlinux]   [k] entry_SYSCALL_64
  1.36%  0.05%  [kernel.vmlinux]   [k] syscall_return_via_sysret
  1.35%  0.09%  [kernel.vmlinux]   [k] syscall_exit_to_user_mode
  1.31%  1.19%  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
  1.30%  0.47%  [kernel.vmlinux]   [k] psi_group_change
  1.05%  1.11%  firefox            [.] malloc
  1.01%  0.69%  [kernel.vmlinux]   [k] update_load_avg
  0.81%  0.34%  libc.so.6          [.] __GI___pthread_mutex_unlock_usercnt
  0.80%  0.61%  [kernel.vmlinux]   [k] __update_load_avg_se
  0.78%  0.67%  [kernel.vmlinux]   [k] __schedule
  0.77%  0.28%  [kernel.vmlinux]   [k] timerqueue_add
  0.71%  0.84%  [kernel.vmlinux]   [k] __update_load_avg_cfs_rq
  0.67%  0.29%  [kernel.vmlinux]   [k] update_curr
  0.65%  0.39%  firefox            [.] free
  0.59%  0.50%  [kernel.vmlinux]   [k] cpuidle_enter_state
#
					

slicing



  • By time
  • Using start/stop events
  • Snapshot with a ring buffer in perf record

Dynamic tracing/counting



  • kprobes
  • uprobes
  • USDT (node.js, for instance)
  • reading variables at probes
  • backtraces
  • show examples using 'perf stat' to count number of some kprobe/uprobe was called

Debugging 1


[root@pumpkin c]# ./prometheusnoop
libbpf: prog 'gauge_1arg_xmm': BPF program load failed: Invalid argument
libbpf: prog 'gauge_1arg_xmm': -- BEGIN PROG LOAD LOG --
0: R1=ctx(off=0,imm=0) R10=fp0
; prometheus_metric_uprobe_1arg_xmm(gauge, valBits, true);
0: (bf) r6 = r1                       ; R1=ctx(off=0,imm=0) R6_w=ctx(off=0,imm=0)
; prometheus_metric_uprobe_1arg_xmm(gauge, valBits, true);
1: (79) r7 = *(u64 *)(r6 +80)         ; R6_w=ctx(off=0,imm=0) R7_w=scalar()
2: (18) r1 = 0x1800000000             ; R1_w=103079215104
; struct metric_event_parms args = {
4: (7b) *(u64 *)(r10 -40) = r1        ; R1_w=103079215104 R10=fp0 fp-40_w=103079215104
; .inc	      = bpf_read_64bit_xmm_register(0),
5: (b7) r1 = 0                        ; R1_w=0
6: (85) call bpf_bpf_read_64bit_xmm_register#212
kernel subsystem misconfigured func bpf_bpf_read_64bit_xmm_register#212
processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
-- END PROG LOAD LOG --
libbpf: prog 'gauge_1arg_xmm': failed to load: -22
libbpf: failed to load object 'prometheusnoop_bpf'
libbpf: failed to load BPF skeleton 'prometheusnoop_bpf': -22
Failed to load and verify BPF skeleton
[root@pumpkin c]#
					

Where a message comes from?



  • "kernel subsystem misconfigured func bpf_bpf_read_64bit_xmm_register#212"
  • Implementing bpf_bpf_read_64bit_xmm_register
  • Without rebuilding the system component

Where?


$ git grep 'kernel subsystem misconfigured func'
kernel/bpf/verifier.c:   verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != ctx\n",
kernel/bpf/verifier.c:   verbose(env, "kernel subsystem misconfigured func %s#%d\n",
kernel/bpf/verifier.c:         "kernel subsystem misconfigured func %s#%d\n",
$
					

Source


$ perf probe -L check_helper_call
<check_helper_call@/home/acme/git/linux/kernel/bpf/verifier.c:0>
      0  static int check_helper_call(struct bpf_verifier_env *env,
      				      struct bpf_insn *insn,
                                      int *insn_idx_p)
         {
<SNIP>
            err = check_func_proto(fn, func_id);
            if (err) {
     58         verbose(env, "kernel subsystem misconfigured func %s#%d\n",
     59                 func_id_name(func_id), func_id);
                return err;
            }
<SNIP>
$

					

Adding a probe


# perf probe check_helper_call:58
Added new event:
  probe:check_helper_call_L58 (on check_helper_call:58)

You can now use it in all perf tools, such as:

	perf record -e probe:check_helper_call_L58 -aR sleep 1

#
					

Tracing


# perf probe check_helper_call:58
Added new event:
  probe:check_helper_call_L58 (on check_helper_call:58)

You can now use it in all perf tools, such as:

	perf record -e probe:check_helper_call_L58 -aR sleep 1
                                        
# perf trace -e probe:check_helper_call_L58
     0.000 prometheusnoop/5535 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270)
     0.078 prometheusnoop/5535 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270)
^C#
					

With a backtrace


# perf trace -e probe:check_helper_call_L58/max-stack=9/ --max-events=1
     0.000 prometheusnoop/5659 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270)
                                       check_helper_call ([kernel.kallsyms])
                                       do_check_common ([kernel.kallsyms])
                                       bpf_check ([kernel.kallsyms])
                                       bpf_prog_load ([kernel.kallsyms])
                                       __sys_bpf ([kernel.kallsyms])
                                       __x64_sys_bpf ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64 ([kernel.kallsyms])
                                       syscall (/usr/lib64/libc.so.6)
#
					
				

Counting events


# perf probe -x /lib64/libc.so.6 free
Added new event:
  probe_libc:free      (on free in /usr/lib64/libc.so.6)

You can now use it in all perf tools, such as:

	perf record -e probe_libc:free -aR sleep 1 
# perf stat -e probe_libc:* sleep 1

 Performance counter stats for 'sleep 1':

               185      probe_libc:free
               415      probe_libc:__libc_malloc

       1.003725291 seconds time elapsed

       0.000000000 seconds user
       0.003431000 seconds sys
#
					

perf trace



  • strace like
  • more targets: system wide, CPUs, cgroups
  • less overhead: no ptrace
  • syscalls tracepoints
  • uses BPF to collect arguments pointer contents

strace


# perf trace sleep 1
         ? (         ): sleep/3501543  ... [continued]: execve())                                           = 0
     0.020 ( 0.001 ms): sleep/3501543 brk()                                                                 = 0x562e3ce8d000
     0.028 ( 0.001 ms): sleep/3501543 arch_prctl(option: 0x3001, arg2: 0x7ffd5bb13e80)                      = -1 EINVAL (Invalid argument)
     0.044 ( 0.004 ms): sleep/3501543 access(filename: "/etc/ld.so.preload", mode: R)                       = -1 ENOENT (No such file or directory)
     0.051 ( 0.004 ms): sleep/3501543 openat(dfd: CWD, filename: "/etc/ld.so.cache", flags: RDONLY|CLOEXEC) = 3
     0.056 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7ffd5bb130a0, flag: 4096) = 0
     0.059 ( 0.005 ms): sleep/3501543 mmap(len: 54771, prot: READ, flags: PRIVATE, fd: 3)                   = 0x7f206c107000
     0.066 ( 0.001 ms): sleep/3501543 close(fd: 3)                                                          = 0
     0.072 ( 0.004 ms): sleep/3501543 openat(dfd: CWD, filename: "/lib64/libc.so.6", flags: RDONLY|CLOEXEC) = 3
     0.077 ( 0.002 ms): sleep/3501543 read(fd: 3, buf: 0x7ffd5bb13208, count: 832)                          = 832
     0.081 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12e00, count: 784, pos: 64)              = 784
     0.084 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12da0, count: 80, pos: 848)              = 80
     0.086 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12d50, count: 68, pos: 928)              = 68
     0.089 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7ffd5bb130a0, flag: 4096) = 0
     0.092 ( 0.004 ms): sleep/3501543 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS)           = 0x7f206c105000
     0.099 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12cf0, count: 784, pos: 64)              = 784
     0.101 ( 0.005 ms): sleep/3501543 mmap(len: 2104720, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3)       = 0x7f206be00000
     0.108 ( 0.012 ms): sleep/3501543 mmap(addr: 0x7f206be28000, len: 1523712, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x28000) = 0x7f206be28000
     0.120 ( 0.007 ms): sleep/3501543 mmap(addr: 0x7f206bf9c000, len: 360448, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x19c000) = 0x7f206bf9c000
     0.129 ( 0.010 ms): sleep/3501543 mmap(addr: 0x7f206bff4000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1f3000) = 0x7f206bff4000
     0.143 ( 0.005 ms): sleep/3501543 mmap(addr: 0x7f206bffa000, len: 32144, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7f206bffa000
     0.162 ( 0.001 ms): sleep/3501543 close(fd: 3)                                                          = 0
     0.170 ( 0.005 ms): sleep/3501543 mmap(len: 12288, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS)          = 0x7f206c102000
     0.178 ( 0.001 ms): sleep/3501543 arch_prctl(option: SET_FS, arg2: 0x7f206c102740)                      = 0
     0.180 ( 0.001 ms): sleep/3501543 set_tid_address(tidptr: 0x7f206c102a10)                               = 3501543 (sleep)
     0.182 ( 0.001 ms): sleep/3501543 set_robust_list(head: 0x7f206c102a20, len: 24)                        = 0
     0.185 ( 0.001 ms): sleep/3501543 rseq(rseq: 0x7f206c1030e0, rseq_len: 32, sig: 1392848979)             = 0
     0.223 ( 0.005 ms): sleep/3501543 mprotect(start: 0x7f206bff4000, len: 16384, prot: READ)               = 0
     0.239 ( 0.004 ms): sleep/3501543 mprotect(start: 0x562e3bb5b000, len: 4096, prot: READ)                = 0
     0.249 ( 0.005 ms): sleep/3501543 mprotect(start: 0x7f206c14a000, len: 8192, prot: READ)                = 0
     0.261 ( 0.001 ms): sleep/3501543 prlimit64(resource: STACK, old_rlim: 0x7ffd5bb13be0)                  = 0
     0.266 ( 0.006 ms): sleep/3501543 munmap(addr: 0x7f206c107000, len: 54771)                              = 0
     0.285 ( 0.002 ms): sleep/3501543 getrandom(ubuf: 0x7f206bfff4d8, len: 8, flags: NONBLOCK)              = 8
     0.288 ( 0.001 ms): sleep/3501543 brk()                                                                 = 0x562e3ce8d000
     0.290 ( 0.004 ms): sleep/3501543 brk(brk: 0x562e3ceae000)                                              = 0x562e3ceae000
     0.300 ( 0.007 ms): sleep/3501543 openat(dfd: CWD, filename: "", flags: RDONLY|CLOEXEC)                 = 3
     0.309 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7f206bff9b80, flag: 4096) = 0
     0.312 ( 0.005 ms): sleep/3501543 mmap(len: 224096080, prot: READ, flags: PRIVATE, fd: 3)               = 0x7f205e800000
     0.322 ( 0.001 ms): sleep/3501543 close(fd: 3)                                                          = 0
     0.351 (1000.073 ms): sleep/3501543 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd5bb13e70) = 0
  1000.451 ( 0.003 ms): sleep/3501543 close(fd: 1)                                                          = 0
  1000.458 ( 0.002 ms): sleep/3501543 close(fd: 2)                                                          = 0
  1000.465 (         ): sleep/3501543 exit_group()                                                          = ?
#
					

more than strace


# perf trace --syscalls -e syscalls:sys_enter_getrandom --switch-on syscalls:sys_enter_getrandom -- sleep 1
        ? (         ): sleep/3501438  ... [continued]: getrandom())                            = 8
    0.002 ( 0.001 ms): sleep/3501438 brk()                                                     = 0x561a2fa09000
    0.004 ( 0.004 ms): sleep/3501438 brk(brk: 0x561a2fa2a000)                                  = 0x561a2fa2a000
    0.015 ( 0.007 ms): sleep/3501438 openat(dfd: CWD, filename: "", flags: RDONLY|CLOEXEC)     = 3
    0.024 ( 0.002 ms): sleep/3501438 newfstatat(dfd: 3, filename: "", statbuf: 0x7fee373f9b80, flag: 4096) = 0
    0.027 ( 0.005 ms): sleep/3501438 mmap(len: 224096080, prot: READ, flags: PRIVATE, fd: 3)   = 0x7fee29c00000
    0.037 ( 0.001 ms): sleep/3501438 close(fd: 3)                                              = 0
    0.068 (1000.047 ms): sleep/3501438 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd072477c0) = 0
 1000.137 ( 0.003 ms): sleep/3501438 close(fd: 1)                                              = 0
 1000.143 ( 0.002 ms): sleep/3501438 close(fd: 2)                                              = 0
 1000.151 (         ): sleep/3501438 exit_group()                                              = ?
#
					

page faults


# perf trace -h pagefault

 Usage: perf trace [] []
    or: perf trace [] --  []
    or: perf trace record [] []
    or: perf trace record [] --  []

    -F, --pf 
                          Trace pagefaults

[root@five ~]#
					

page faults


# perf trace -F
     0.000 ( 0.000 ms): gnome-shell/10430 majfault [__memmove_avx_unaligned_erms+0x30a] => /dev/dri/card1@0x124aa8000 (d.)
  1631.809 ( 0.000 ms): rpmkeys/3184890 majfault [0x14f0] => /usr/bin/rpmkeys@0x14f0 (d.)
  1632.565 ( 0.000 ms): rpmkeys/3184890 majfault [0x89d0] => /usr/lib64/libpopt.so.0.0.2@0x89d0 (d.)
  1650.404 ( 0.000 ms): rpmkeys/3184890 majfault [0x64c80] => /usr/lib64/librpm_sequoia.so.1@0x64c80 (d.)
  2415.842 ( 0.000 ms): dnf/3183950 majfault [0x812bf] => /usr/lib64/libsqlite3.so.0.8.6@0x812bf (d.)
  3722.995 ( 0.000 ms): alternatives/3184948 majfault [padzero+0x38] => 0x563ba5e34004 (?k)
					

with backtraces


# perf trace -F --call-graph dwarf
     0.000 ( 0.000 ms): gnome-shell/10430 majfault [__memmove_avx_unaligned_erms+0x30a] => /dev/dri/card1@0x124aa8000 (d.)
                                       __memmove_avx_unaligned_erms (/usr/lib64/libc.so.6)
  1631.809 ( 0.000 ms): rpmkeys/3184890 majfault [0x14f0] => /usr/bin/rpmkeys@0x14f0 (d.)
                                       [0x14f0] (/usr/bin/rpmkeys)
  1632.565 ( 0.000 ms): rpmkeys/3184890 majfault [0x89d0] => /usr/lib64/libpopt.so.0.0.2@0x89d0 (d.)
                                       [0x89d0] (/usr/lib64/libpopt.so.0.0.2)
  1650.404 ( 0.000 ms): rpmkeys/3184890 majfault [0x64c80] => /usr/lib64/librpm_sequoia.so.1@0x64c80 (d.)
                                       [0x64c80] (/usr/lib64/librpm_sequoia.so.1)
  2415.842 ( 0.000 ms): dnf/3183950 majfault [0x812bf] => /usr/lib64/libsqlite3.so.0.8.6@0x812bf (d.)
                                       [0x812bf] (/usr/lib64/libsqlite3.so.0.8.6)
  3722.995 ( 0.000 ms): alternatives/3184948 majfault [padzero+0x38] => 0x563ba5e34004 (?k)
                                       padzero ([kernel.kallsyms])
                                       load_elf_binary ([kernel.kallsyms])
                                       bprm_execve ([kernel.kallsyms])
                                       do_execveat_common.isra.0 ([kernel.kallsyms])
                                       __x64_sys_execve ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
					

block layer


# perf list block:*

List of pre-defined events (to be used in -e or -M):

  block:block_bio_backmerge           [Tracepoint event]
  block:block_bio_bounce              [Tracepoint event]
  block:block_bio_complete            [Tracepoint event]
  block:block_bio_frontmerge          [Tracepoint event]
  block:block_bio_queue               [Tracepoint event]
  block:block_bio_remap               [Tracepoint event]
  block:block_dirty_buffer            [Tracepoint event]
  block:block_getrq                   [Tracepoint event]
  block:block_plug                    [Tracepoint event]
  block:block_rq_complete             [Tracepoint event]
  block:block_rq_error                [Tracepoint event]
  block:block_rq_insert               [Tracepoint event]
  block:block_rq_issue                [Tracepoint event]
  block:block_rq_merge                [Tracepoint event]
  block:block_rq_remap                [Tracepoint event]
  block:block_rq_requeue              [Tracepoint event]
  block:block_split                   [Tracepoint event]
  block:block_touch_buffer            [Tracepoint event]
  block:block_unplug                  [Tracepoint event]
#
					

problems



  • What is a block:block_plug
  • Domain knowledge
  • Some documentation in the kernel sources
  • block_plug
  • 'perf list' should access that

block_plug


Parameters:

struct request_queue *q

    request queue to plug

Description:

Plug the request queue q. Do not allow block operation requests to be
sent to the device driver. Instead, accumulate requests in the queue
to improve throughput performance of the block device.
					

block layer


# perf trace -e block:* --max-events=20 --call-graph=fp
     0.000 dd/521824 block:block_dirty_buffer(dev: 8388608, sector: 428361, size: 4096)
                                       mark_buffer_dirty ([kernel.kallsyms])
                                       mark_buffer_dirty ([kernel.kallsyms])
                                       __block_commit_write.isra.0 ([kernel.kallsyms])
                                       block_write_end ([kernel.kallsyms])
                                       blkdev_write_end ([kernel.kallsyms])
                                       generic_perform_write ([kernel.kallsyms])
                                       __generic_file_write_iter ([kernel.kallsyms])
                                       blkdev_write_iter ([kernel.kallsyms])
                                       vfs_write ([kernel.kallsyms])
                                       ksys_write ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64 ([kernel.kallsyms])
                                       __GI___libc_write (/usr/lib64/libc.so.6)
                                       write_output (/usr/bin/dd)
                                       main (/usr/bin/dd)
                                       __libc_start_call_main (/usr/lib64/libc.so.6)
                                       __libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6)
                                       _start (/usr/bin/dd)
					

Reached USB


    12.896 usb-storage/467 block:block_rq_complete(dev: 8388608, sector: 3397560, nr_sector: 8, rwbs: "R", cmd: "")
                                       blk_update_request ([kernel.kallsyms])
                                       blk_update_request ([kernel.kallsyms])
                                       scsi_end_request ([kernel.kallsyms])
                                       scsi_io_completion ([kernel.kallsyms])
                                       usb_stor_control_thread ([usb_storage])
                                       kthread ([kernel.kallsyms])
                                       ret_from_fork ([kernel.kallsyms])
					

perf script



  • perl and python
  • Custom processing of events
  • From a perf.data file
  • Or directly from the standard input

perf script 2



  • Early support for customization
  • Still useful
  • firefox profiler script
  • Written by GSoC participant (Anup Sharma)
  • But being replaced by libbpf based tools
  • To aggregate near origin
  • Reducing kernel/userspace traffic

Distributed scripts


# perf script -l
List of available trace scripts:
  rwtop [interval]                     system-wide r/w top
  wakeup-latency                       system-wide min/max/avg wakeup latency
  rw-by-pid                            system-wide r/w activity
  rw-by-file                     r/w activity for a program, by file
  failed-syscalls [comm]               system-wide failed syscalls
  futex-contention                     futext contention measurement
  netdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing time
  export-to-sqlite [database name] [columns] [calls] export perf data to a sqlite3 database
  gecko                                create firefox gecko profile json format from perf.data
  mem-phys-addr                        resolve physical address samples
  sched-migration                      sched migration overview
  sctop [comm] [interval]              syscall top
  failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid
  intel-pt-events                      print Intel PT Events including Power Events and PTWRITE
  syscall-counts [comm]                system-wide syscall counts
  compaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] display time taken by mm compaction
  task-analyzer                        analyze timings of tasks
  stackcollapse                        produce callgraphs in short form for scripting use
  syscall-counts-by-pid [comm]         system-wide syscall counts, by pid
  powerpc-hcalls
  export-to-postgresql [database name] [columns] [calls] export perf data to a postgresql database
  net_dropmonitor                      display a table of dropped frames
  event_analyzing_sample               analyze all perf samples
  flamegraph                           create flame graphs
#
					

rwtop: reads


# perf script rwtop
read counts by pid:

   pid             comm  # reads  bytes_req  bytes_read
------  ---------------  -------  ---------  ----------
724574  dd                  3936    2015232     2015232
 13277  pool-gsd-smartc       14      23600       23600
670459  sshd                  10     327680        2764
   937  systemd-oomd           4       2064        2064
 43738  MemoryPoller           3       3072        1559
437440  DNS Res~er #136        2       8192         797
437735  DNS Res~er #139        2       8192         797
 12995  pcscd                 14        112         112
 43488  firefox               67         67          67
  1128  thermald               3      24573          30
439339  Isolated Web Co       20         20          20
439899  Isolated Web Co       10         10          10
   938  systemd-resolve        4      16416           8
524177  Isolated Web Co        7          7           7
 43524  Socket Thread          5      10240           5
478958  Isolated Web Co        4          4           4
439563  Isolated Servic        4          4           4
 43749  Isolated Web Co        3          3           3
 43521  IPC I/O Parent         3          3           3
					

rwtop: writes


write counts by pid:

   pid             comm  # writes  bytes_written
------  ---------------  --------  -------------
724574  dd                   3936        2015232
670459  sshd                   10           3208
 43488  firefox                52             52
439339  Isolated Web Co        10             10
 43521  IPC I/O Parent          5              5
439344  Socket Thread           5              5
439899  Isolated Web Co         5              5
524177  Isolated Web Co         4              4
 43525  IPDL Background         4              4
439348  Timer                   4              4
 43734  URL Classifier          4              4
439912  Timer                   3              3
 43524  Socket Thread           3              3
 43522  Timer                   3              3
439907  IPC I/O Child           2              2
478967  Timer                   2              2
724466  Backgro~l #1146         2              2
478958  Isolated Web Co         2              2
 43749  Isolated Web Co         2              2
#
					

perf script workflow



  • perf record events wanted
  • Generate script skeleton from perf.data
  • Write logic
  • Run with perf script

Example



  • malloc requests

List the function


# perf probe -x /lib64/libc.so.6 -V malloc
Available variables at malloc
        @<__libc_malloc+0>
                char*   __PRETTY_FUNCTION__
                size_t  bytes
#
					

Dynamic probe


# perf probe -x /lib64/libc.so.6 __libc_malloc bytes
Added new event:
  probe_libc:__libc_malloc (on __libc_malloc in /usr/lib64/libc.so.6 with bytes)

You can now use it in all perf tools, such as:

	perf record -e probe_libc:__libc_malloc -aR sleep 1

#
					

Record 5 seconds


# perf record -a -e probe_libc:__libc_malloc sleep 5s
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 2.668 MB perf.data (8587 samples) ]
#
					

Did we find some?


# perf script | head
   perf  691772 [007] 361976.720228: probe_libc:__libc_malloc: (7ff7e3cbeb90) bytes=0x40
   perf  691772 [007] 361976.720265: probe_libc:__libc_malloc: (7ff7e3cbeb90) bytes=0xa8
  sleep  691773 [002] 361976.721292: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x5
  sleep  691773 [002] 361976.721319: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x1d8
  sleep  691773 [002] 361976.721329: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x1000
  sleep  691773 [002] 361976.721336: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x640
  sleep  691773 [002] 361976.721339: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x400
  sleep  691773 [002] 361976.721352: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x5
  sleep  691773 [002] 361976.721354: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x33
  sleep  691773 [002] 361976.721357: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x33
#
                                        
# perf script | wc -l
8587
#
					

A new script

# perf evlist
probe_libc:__libc_malloc
dummy:HG
# Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
#
                                        
# perf evlist --trace-fields
probe_libc:__libc_malloc: trace_fields: __probe_ip,bytes
dummy:HG: (not a tracepoint)
#
                                        
# perf script -g python
generated Python script: perf-script.py
#
                                        
# ls -la perf-script.py
-rw-r--r--. 1 root root 2065 Sep 17 11:26 perf-script.py
#
					

Running it


# mv perf-script.py mallocs.py
# perf script -s mallocs.py  | tail
probe_libc____libc_malloc  2 361981.720766854  405742 gnome-control-c
  __probe_ip=139765834955664, bytes=48 
  Sample: {cpu=2, cpumode=2, period=1, tid=405742} 
probe_libc____libc_malloc  4 361981.722412299  463773 weechat
  __probe_ip=140083330030480, bytes=15
  Sample: {cpu=4, cpumode=2, period=1, tid=463773}
in trace_end
#
					

skeleton

def print_header(event_name, cpu, secs, nsecs, pid, comm):
	print("%-20s %5u %05u.%09u %8u %-20s " % \
	(event_name, cpu, secs, nsecs, pid, comm), end="")
                                        
def trace_begin():
	print("in trace_begin")
                                        
def trace_end():
	print("in trace_end")
                                        
def probe_libc____libc_malloc(event_name, context, common_cpu,
			      common_secs, common_nsecs, common_pid,
			      common_comm, common_callchain, __probe_ip, bytes,
			      perf_sample_dict):

		print_header(event_name, common_cpu, common_secs, common_nsecs,
			     common_pid, common_comm)
		print("__probe_ip=%u, bytes=%u" % (__probe_ip, bytes))
		print('Sample: {'+get_dict_as_string(perf_sample_dict['sample'], ', ')+'}')
					

Do something



  • Edit it
  • Populate a dictionary keyed by malloc size
  • Bucketize at event handler
  • Print at the trace_end handler
  • Only sizes with at least 64 mallocs

Initialization


nr_entries = {}
					

event handler


def probe_libc____libc_malloc(event_name, context, common_cpu,
                              common_secs, common_nsecs,
			      common_pid, common_comm,
                              common_callchain, __probe_ip,
			      bytes, perf_sample_dict): 
    if bytes in nr_entries:
        nr_entries[bytes] += 1
    else:
        nr_entries[bytes] = 1
					

trace_end handler


def trace_end():
    for size in sorted(nr_entries.keys()):
        nr_entries = nr_entries[size]
        if (nr_entries >= 64):
            print("%6d: %4d" % (size, nr_entries))
					

running it

# perf script -s mallocs.py
     5:  342
    16: 1247
    24:  328
    27:   82
    32: 1248
    35:   80
    38:  104
    40:  320
    41:  119
    44:   70
    48:  340
    56:  349
    64:  958
    96:  312
   112:  298
   128:  594
   472:  419
   800:   88
  4096:  309
  8192:   98
#
					

Command line arg


min_nr_entries = 64 
def trace_end():
    for size in sorted(sizes.keys()):
        nr_entries = sizes[size]
        if (nr_entries >= min_nr_entries):
            print("%6d: %4d" % (size, nr_entries))
					
if __name__ == '__main__':
    parser = argparse.ArgumentParser("Show malloc histogram")
    parser.add_argument('--min_nr_entries', default=64,
    			help='Show sizes with at least min_nr_entries')
                                        
    # Parse the command-line arguments
    args = parser.parse_args()
    min_nr_entries = int(args.min_nr_entries)
					

Help


# perf script -s mallocs.py -- --help
usage: show malloc histogram [-h] [--min_nr_entries MIN_NR_ENTRIES]

options:
  -h, --help            show this help message and exit
  --min_nr_entries MIN_NR_ENTRIES
                        Show sizes with at least min_nr_entries
#
					

Running it


# perf script -s mallocs.py -- --min_nr_entries 512
    16: 1247
    32: 1248
    64:  958
   128:  594
# 
					

BPF



  • Plugs into the existing infrastructure
  • kprobes, uprobes, tracepoints, etc
  • bytecode
  • maps: hash tables, arrays, etc
  • Aggregates early
  • No perf.data
  • Right after the event takes place

runqslower



  • Reimplementation of BCC's runqslower
  • Traces high scheduling delays
  • BPF skeleton
  • auto-generated from BPF object file
  • memory-mapped interface to global data
  • Auto-generation of "relocatable" vmlinux.h
  • Necessary for BTF-typed raw tracepoints with direct memory access
Building it:
$ make -C tools/bpf/runqslower/
make: Entering directory '/home/acme/git/bpf/tools/bpf/runqslower'
  LINK     /home/acme/git/bpf/tools/bpf/runqslower/.output//libbpf/libbpf.a
  LINK     /home/acme/git/bpf/tools/bpf/runqslower/.output/bpftool
  INSTALL  bpftool
  GEN      vmlinux.h
  BPF      runqslower.bpf.o
  GEN-SKEL runqslower.skel.h
  CC       runqslower.o
  BINARY   runqslower
make: Leaving directory '/home/acme/git/bpf/tools/bpf/runqslower'
					
Resulting binary:
$ file tools/bpf/runqslower/.output/runqslower
tools/bpf/runqslower/.output/runqslower: ELF 64-bit LSB executable, x86-64,
version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
for GNU/Linux 3.2.0,
BuildID[sha1]=e7986148fa9629ef9ca8ea1fd76c3a0bbdf70b06, with debug_info,
not stripped
					
$ ls -la tools/bpf/runqslower/.output/runqslower
-rwxrwxr-x. 1 507016 tools/bpf/runqslower/.output/runqslower
$
					
$ strip tools/bpf/runqslower/.output/runqslower
$ ls -la tools/bpf/runqslower/.output/runqslower
-rwxrwxr-x. 1 249208 tools/bpf/runqslower/.output/runqslower
$
					
Running it
$ sudo tools/bpf/runqslower/.output/runqslower 500
Tracing run queue latency higher than 500 us
TIME     COMM             PID           LAT(us)
19:00:11 swapper/3        22111            1904
19:00:30 weechat          41               1244
19:00:30 weechat          41               1316
^C
$
					
Benefits

  • Boilerplate being reduced
  • Everything needed in one binary
  • Small
  • Runs in any kernel
  • That have the subset of struct fields used
How it works

  • Similar to BCC
  • Userspace part
  • BPF/kernel part
  • Natural struct field deref
  • No bpf_probe_read
Common userspace/BPF header
$ cat tools/bpf/runqslower/runqslower.h
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
#ifndef __RUNQSLOWER_H
#define __RUNQSLOWER_H

#define TASK_COMM_LEN 16

struct event {
	char task[TASK_COMM_LEN];
	__u64 delta_us;
	pid_t pid;
};

#endif /* __RUNQSLOWER_H */
$
					
Userspace part

  • tools/bpf/runqslower/runqslower.c
  • Uses argp.h for command line options
  • Initializes global data with those options
  • Sends the BPF bytecode to the kernel
  • Sets up perf ring buffer
  • Reads events
  • All using tools/lib/bpf (libbpf)
Userspace part: event handler
#include <bpf.h>
#include "runqslower.h"
#include "runqslower.skel.h"

void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
{
        const struct event *e = data;

	strftime(ts, sizeof(ts), "%H:%M:%S", tm);
        printf("%-8s %-16s %-6d %14llu\n", ts, e->task, e->pid, e->delta_us);
}
					
Userspace part: main loop
int main(int argc, char **argv)
{
        err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
        obj = runqslower_bpf__open();
        obj->rodata->min_us = env.min_us; // arg, BPF global var
					
        err = runqslower_bpf__load(obj);
        err = runqslower_bpf__attach(obj);
					
        printf("%-8s %-16s %-6s %14s\n", "TIME", "COMM", "PID", "LAT(us)");

        pb_opts.sample_cb = handle_event;
        pb_opts.lost_cb = handle_lost_events;
        pb = perf_buffer__new(bpf_map__fd(obj->maps.events), 64, &pb_opts);
					
        while ((err = perf_buffer__poll(pb, 100)) >= 0);
}
					
Kernel/BPF part

  • tools/bpf/runqslower/runqslower.bpf.c
  • Uses per pid hash map for timestamps
  • To calculate the deltas/latencies
  • Connects to BTF tracepoints
  • To use normal pointer dereference
  • Sets up events
  • Pushes to userspace via perf ring buffer
BPF/kernel program
$ cat tools/bpf/runqslower/runqslower.bpf.c
#include "vmlinux.h"
#include <bpf_helpers.h>
#include "runqslower.h"
					
const volatile __u64 min_us = 0;
const volatile pid_t targ_pid = 0;
					
struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__uint(max_entries, 10240);
	__type(key, u32);
	__type(value, u64);
} start SEC(".maps");
					
struct {
	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
	__uint(key_size, sizeof(u32));
	__uint(value_size, sizeof(u32));
} events SEC(".maps");
				
Symbols
$ readelf -sW tools/bpf/runqslower/.output/runqslower.bpf.o | \
	egrep min_us\|targ_pid
   Num:  Size  Type    Bind    Vis     Ndx  Name
    18:   8    OBJECT  GLOBAL  DEFAULT  5   min_us
    20:   4    OBJECT  GLOBAL  DEFAULT  5   targ_pid
$
					
BPF/kernel program
__always_inline static int trace_enqueue(u32 tgid, u32 pid)
{
	if (!pid || (targ_pid && targ_pid != pid))
		return 0;
	u64 ts = bpf_ktime_get_ns();
	bpf_map_update_elem(&start, &pid, &ts, 0);
	return 0;
}
					
SEC("tp_btf/sched_wakeup") int handle__sched_wakeup(u64 *ctx)
{
	// TP_PROTO(struct task_struct *p)

	struct task_struct *p = (void *)ctx[0];
	return trace_enqueue(p->tgid, p->pid);
}
					
BPF/kernel program: validation
SEC("tp_btf/sched_switch") int handle__sched_switch(u64 *ctx)
{
	// TP_PROTO(bool preempt, struct task_struct *prev,
	//			  struct task_struct *next)

	struct task_struct *prev = (struct task_struct *)ctx[1];
	struct task_struct *next = (struct task_struct *)ctx[2];
	struct event event = {};
	u32 pid = next->pid;
					
	u64 *tstamp = bpf_map_lookup_elem(&start, &pid);
	if (!tstamp)
		return 0;   /* missed enqueue */
					
	u64 delta_us = (bpf_ktime_get_ns() - *tstamp) / 1000;
	if (min_us && delta_us <= min_us)
		return 0;
					
BPF/kernel program: push to userspace
	event.pid = pid;
	event.delta_us = delta_us;

	bpf_get_current_comm(&event.task, sizeof(event.task));
					
	bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
			      &event, sizeof(event));
					
	bpf_map_delete_elem(&start, &pid);
	return 0;
}
					
BPF object details
$ ls -la tools/bpf/runqslower/.output/runqslower.bpf.o
-rw-rw-r--. 1 31688 tools/bpf/runqslower/.output/runqslower.bpf.o
					
$ file tools/bpf/runqslower/.output/runqslower.bpf.o
tools/bpf/runqslower/.output/runqslower.bpf.o: ELF 64-bit LSB relocatable,
        eBPF, version 1 (SYSV), not stripped
$
					
$ eu-readelf -SW tools/bpf/runqslower/.output/runqslower.bpf.o | grep \.rel
Section Headers:
[Nr] Name                   Type Size
[11] .reltp_btf/sched_wakeup REL   20
[13] .reltp_btf/sched_switch REL   60
[14] .rel.BTF                REL   50
[15] .rel.BTF.ext            REL  460
$
					

BPF_PROG_TYPE_STRUCT_OPS



  • TCP Congestion control algorithms
  • Pointer table
  • Faster turnaround for testing new algos
  • Builds on top of CO-RE infra
  • General mechanism for any kernel struct ops
TCP congestion control modules
$ cd net/ipv4/
$ grep -l 'tcp_register_congestion_control(&' tcp_*.c
tcp_bbr.c
tcp_bic.c
tcp_cdg.c
tcp_cubic.c
tcp_dctcp.c
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c
tcp_illinois.c
tcp_lp.c
tcp_nv.c
tcp_scalable.c
tcp_vegas.c
tcp_veno.c
tcp_westwood.c
tcp_yeah.c
$
					
Example: DCTCP in BPF

  • Data Center TCP
  • tools/testing/selftests/bpf/progs/bpf_dctcp.c
  • Just for testing
  • Not the same as net/ipv4/tcp_dctcp.c
  • Helpers for tcp_sock/inet_connection_sock structs/logic
  • structs with same name as in kernel
  • But with just the fields needed for DCTCP
Passing arguments
static unsigned int dctcp_shift_g = 4; /* g = 1/2^4 */
static unsigned int dctcp_alpha_on_init = DCTCP_MAX_ALPHA;
					
Subset of struct tcp_sock
// tools/testing/selftests/bpf/bpf_tcp_helpers.h
struct tcp_sock {
       struct inet_connection_sock     inet_conn;
       __u32   rcv_nxt, snd_nxt, snd_una;
       __u8    ecn_flags;
       __u32   delivered;
       __u32   delivered_ce;
       __u32   snd_cwnd, snd_cwnd_cnt, snd_cwnd_clamp;
       __u32   snd_ssthresh;
       __u8    syn_data:1,     /* SYN includes data */
               syn_fastopen:1, /* SYN includes Fast Open option */
               syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
               syn_fastopen_ch:1, /* Active TFO re-enabling probe */
               syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
               save_syn:1,     /* Save headers of SYN packet */
               is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
               syn_smc:1;      /* SYN includes SMC */
       __u32   max_packets_out;
       __u32   lsndtime;
       __u32   prior_cwnd;
} __attribute__((preserve_access_index));
					
The full 'struct tcp_sock'
$ pahole tcp_sock | grep size: -A4
	/* size: 2192, cachelines: 35, members: 135 */
	/* sum members: 2181, holes: 3, sum holes: 8 */
	/* sum bitfield members: 24 bits (3 bytes) */
	/* paddings: 3, sum paddings: 10 */
	/* last cacheline: 16 bytes */
$
					
BPF_MAP_TYPE_STRUCT_OPS

  • To register/unregister/introspect struct ops
  • Receives tcp_congestion_ops pointer
  • Populates map
  • bpf map dump
  • Shows how many users (refcnt)
struct ops init()
SEC("struct_ops/dctcp_init")
void BPF_PROG(dctcp_init, struct sock *sk)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	struct dctcp *ca = inet_csk_ca(sk);

	ca->prior_rcv_nxt = tp->rcv_nxt;
	ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
	ca->loss_cwnd = 0;
	ca->ce_state = 0;

	dctcp_reset(tp, ca);
}
					
struct ops ssthresh
SEC("struct_ops/dctcp_ssthresh")
__u32 BPF_PROG(dctcp_ssthresh, struct sock *sk)
{
    struct dctcp *ca = inet_csk_ca(sk);
    struct tcp_sock *tp = tcp_sk(sk);

    ca->loss_cwnd = tp->snd_cwnd;
    return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) > 11U), 2U);
}
					

pahole



  • Type information
  • Introspection
  • Optimization
  • Better memory hierarchy usage
  • CPU cache

shrinking sockets



  • Linux 2.4
  • struct sock
  • Big union
  • All protocols

shrunk socket



  • Reduce CPU cache utilization
  • Remove alignment holes
  • Reorder struct fields
  • Combine some bitfields
  • Demote others
  • To better pack

socket hierarchy



  • struct sock
  • struct tcp_sock
  • struct tcp6_sock
  • struct udp_sock
  • And all the other protocols

Neverending story


Date: Mon Nov 15 11:02:37 2021 -0800
From: Eric Dumazet 

Move sk_bind_phc next to sk_peer_lock to fill a hole.

@@ -489,5 +489,6 @@  struct sock {
	u16			sk_busy_poll_budget;
 #endif
	spinlock_t		sk_peer_lock;
+	int			sk_bind_phc;
	struct pid		*sk_peer_pid;
	const struct cred	*sk_peer_cred;
@@ -498,5 +499,4 @@  struct sock {
	seqlock_t		sk_stamp_seq;
 #endif
	u16			sk_tsflags;
-	int			sk_bind_phc;
	u8			sk_shutdown;
					

Some more


$ git log -2 --oneline 1ace2b4d2b4e1db8f
1ace2b4d2b4e1db8 net: shrink struct sock by 8 bytes
1b31debca8328448 ipv6: shrink struct ipcm6_cookie
$
					

Tedious



  • Manual
  • Can we automate this?
  • gdb knows about types, how?

DWARF



  • Debugging With Arbitrary Record Formats
  • Executable and Linkable Format's friend
  • More recently we got ORC too, in the Linux kernel

pahole



  • Read DWARF
  • Rebuild source from type info
  • Augmented with alignment info
  • Showing the holes
  • Cacheline boundaries
  • Lots more

Example


$ pahole -C list_head ~/git/build/v5.18-rc6+/vmlinux
struct list_head {
	struct list_head *         next;      /*   0  8 */
	struct list_head *         prev;      /*   8  8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};
$
					

CERN ATLAS Migration



An alignment hole


# pahole -C _IO_FILE ~/bin/perf | head
struct _IO_FILE {
	int                        _flags;         /*   0  4 */

	/* XXX 4 bytes hole, try to pack */

	char *                     _IO_read_ptr;   /*   8  8 */
	char *                     _IO_read_end;   /*  16  8 */
	char *                     _IO_read_base;  /*  24  8 */
	char *                     _IO_write_base; /*  32  8 */
	char *                     _IO_write_ptr;  /*  40  8 */
#
					

pahole --reorganize



  • To elliminate holes
  • Demotes bitfields
  • show-reorg-steps
  • __alignment__ attribute
  • false sharing

pahole --reorganize 2



  • DWARF now has attribute align
  • Use that in the reorg algo

--reorganize example


$ pahole --reorganize task_struct | tail
   /* --- cacheline 142 boundary (9088 bytes) was 56 bytes ago --- */
   struct thread_struct thread __attribute__((__aligned__(64))); /* 9144 4416 */

    /* size: 13560, cachelines: 212, members: 252 */
    /* sum members: 13487, holes: 2, sum holes: 57 */
    /* sum bitfield members: 79 bits, bit holes: 2, sum bit holes: 49 bits */
    /* paddings: 6, sum paddings: 49 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
    /* last cacheline: 56 bytes */
};  /* saved 136 bytes and 2 cachelines! */
$
					

Using pahole



  • Show many examples for data structure querying:
  • - What structures have members of some type
  • - Ditto for pointers
  • - BTF for ubiquitous kernel data structure querying

contains


# pahole --contains sock 
inet_sock
unix_sock
packet_sock
alg_sock
bpf_ctx_convert
netlink_sock
mctp_sock
xdp_sock
#
					

contains inet_sock


# pahole --contains inet_sock
inet_connection_sock
raw_sock
udp_sock
sctp_sock
raw6_sock
#
					

contains inet_connection_sock


# pahole --contains inet_connection_sock
tcp_sock
mptcp_sock
#
					

tcp_sock


# pahole tcp_sock | head
struct tcp_sock {
	struct inet_connection_sock inet_conn;       /*     0  1400 */
	/* --- cacheline 21 boundary (1344 bytes) was 56 bytes ago --- */
	u16                        tcp_header_len;   /*  1400  2 */
	u16                        gso_segs;         /*  1402  2 */
	__be32                     pred_flags;       /*  1404  4 */
	/* --- cacheline 22 boundary (1408 bytes) --- */
	u64                        bytes_received;   /*  1408  8 */
	u32                        segs_in;          /*  1416  4 */
	u32                        data_segs_in;     /*  1420  4 */
#
					
--expand_types
# pahole --expand_types tcp_sock | head
struct tcp_sock {
	struct inet_connection_sock {
		struct inet_sock {
			struct sock {
				struct sock_common {
					union {
						/* typedef __addrpair -> __u64 */ long long unsigned int skc_addrpair; /* 0 8 */
						struct {
							/* typedef __be32 -> __u32 */ unsigned int   skc_daddr;     /* 0 4 */
							/* typedef __be32 -> __u32 */ unsigned int   skc_rcv_saddr; /* 4 4 */
						};                                                                  /* 0 8 */
					};                                                                          /* 0 8 */

#
					

perf c2c/Data profiling



  • Example from presentation at Plumbers Dublin/2022
  • Example from presentation at devconf.cz, from Joe Mario: postgresql lock

What is this about?



  • Data structures
  • Optimal field ordering
  • CPU Cache usage
  • False Sharing
  • Avoiding unintendend outcomes

Why do we care?



  • Cachelines
  • Cache hierarchy
  • Cache coherency
  • L1, L2, LLC
  • NUMA: Remote memory
  • Load latencies

Problems?



  • Non cache aligned arrays
  • Close by read-mostly and write-mostly variables
  • Lots fixed over the years
  • Learn from it!

Tedious



  • Being done already
  • Manually
  • Tinkering
  • Move, rebuild, perf stat it

Manually doing it



  • Look for "false sharing" on git log
  • Eric Dumazet does it a lot

First example


commit 91b6d325635617540b6a1646ddb138bb17cbd569
Author: Eric Dumazet 
Date:   Mon Nov 15 11:02:39 2021 -0800

    net: cache align tcp_memory_allocated, tcp_sockets_allocated

    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.
					

The change


+++ b/net/ipv4/tcp.c
-atomic_long_t tcp_memory_allocated; // Current allocated memory
+atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; // Current allocated memory

-struct percpu_counter tcp_sockets_allocated;
+struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
					

bcache_dev_sectors_dirty_add()



  • Always set_bit(stripe, full_dirty_stripes)
  • Cacheline being invalidated constantly
  • 100G dirty data with 20 threads
  • 50 times slower
  • Change to test_bit() first
  • Mingzhe Zou
  • 7b1002f7cfe581930f63787a0b3de0144e61ed55

The change


+++ b/drivers/md/bcache/writeback.c
                sectors_dirty = atomic_add_return(s,
                                        d->stripe_sectors_dirty + stripe);
-               if (sectors_dirty == d->stripe_size)
-                       set_bit(stripe, d->full_dirty_stripes);
-               else
-                       clear_bit(stripe, d->full_dirty_stripes);
+               if (sectors_dirty == d->stripe_size) {
+                       if (!test_bit(stripe, d->full_dirty_stripes))
+                               set_bit(stripe, d->full_dirty_stripes);
+               } else {
+                       if (test_bit(stripe, d->full_dirty_stripes))
+                               clear_bit(stripe, d->full_dirty_stripes);
+               }
					

page_counter


  • cgroup perf regression
  • 'usage' and 'parent' on the same cacheline
  • 'parent' is mostly read
  • 'usage' is mostly written
  • False sharing
  • Feng Tang used 'perf c2c' to detect it
  • https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian
  • 802f1d522d5fdaefc2b935141bc8fe03d43a99ab
The change
+++ b/include/linux/page_counter.h
@@ -12,7 +12,6 @@ struct page_counter {
        unsigned long low;
        unsigned long high;
        unsigned long max;
-       struct page_counter *parent;

        /* effective memory.min and memory.min usage tracking */
        unsigned long emin;
@@ -26,6 +25,12 @@ struct page_counter {
        unsigned long watermark;
        unsigned long failcnt;
+       /*
+        * 'parent' is placed here to be far from 'usage' to reduce cache
+        * false sharing, as 'usage' is written mostly while parent is
+        * frequently read for cgroup's hierarchical counting nature.
+        */
+       struct page_counter *parent;
 };
					
The new struct
$ pahole page_counter
struct page_counter {
	atomic_long_t        usage;                /*   0   8 */
	long unsigned int    min;                  /*   8   8 */
	long unsigned int    low;                  /*  16   8 */
	long unsigned int    high;                 /*  24   8 */
	long unsigned int    max;                  /*  32   8 */
	long unsigned int    emin;                 /*  40   8 */
	atomic_long_t        min_usage;            /*  48   8 */
	atomic_long_t        children_min_usage;   /*  56   8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	long unsigned int    elow;                 /*  64   8 */
	atomic_long_t        low_usage;            /*  72   8 */
	atomic_long_t        children_low_usage;   /*  80   8 */
	long unsigned int    watermark;            /*  88   8 */
	long unsigned int    failcnt;              /*  96   8 */
	struct page_counter *parent;               /* 104   8 */

	/* size: 112, cachelines: 2, members: 14 */
	/* last cacheline: 48 bytes */
};
$
					

Layout



  • Type info: DWARF, BTF
  • BTF now always available
  • per-cpu variable types

BTF



  • Compact
  • Yeah, BPF stuff
  • Used with BPF's CO-RE
  • And other BPF features
  • /sys/kernel/btf/

Small


$ cd /sys/kernel/btf
$ ls -lh vmlinux
-r--r--r--. 1 root root 5.1M Sep  8 20:38 vmlinux
$
$ ls -lh i915
-r--r--r--. 1 root root 556K Sep 12 09:29 i915
$
$ ls -1 | wc -l
204
$ lsmod | wc -l
204
$ lsmod | head -2
Module                  Size  Used by
sctp                  434176  28
$

					
pahole
$ pahole page_pool
struct page_pool {
	struct page_pool_params  p;                     /* 0  56 */
	struct delayed_work      release_dw;            /* 56  88 */
	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	void                     (*disconnect)(void *); /* 144  8 */
	long unsigned int        defer_start;           /* 152  8 */
	long unsigned int        defer_warn;            /* 160  8 */
	u32                      pages_state_hold_cnt;  /* 168  4 */
	unsigned int             frag_offset;           /* 172  4 */
	struct page *            frag_page;             /* 176  8 */
	long int                 frag_users;            /* 184  8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                      xdp_mem_id;           /*   192     4 */
	/* XXX 60 bytes hole, try to pack */

	/* --- cacheline 4 boundary (256 bytes) --- */
	struct pp_alloc_cache    alloc __attribute__((__aligned__(64))); /*   256  1032 */
	/* XXX 56 bytes hole, try to pack */

	/* --- cacheline 21 boundary (1344 bytes) --- */
	struct ptr_ring          ring __attribute__((__aligned__(64))); /*  1344   192 */
	/* XXX last struct has 48 bytes of padding */
	/* --- cacheline 24 boundary (1536 bytes) --- */
	atomic_t                 pages_state_release_cnt; /*  1536     4 */
	refcount_t               user_cnt;             /*  1540     4 */
	u64                      destroy_cnt;          /*  1544     8 */
	/* size: 1600, cachelines: 25, members: 15 */
	/* sum members: 1436, holes: 2, sum holes: 116 */
	/* padding: 48 paddings: 2, sum paddings: 52 */
	/* forced alignments: 2, forced holes: 2, sum forced holes: 116 */
} __attribute__((__aligned__(64)));
$
					

CPU helps



  • Intel PEBS
  • AMD IBS
  • ARM CoreSight

Example PEBS doc


18.8.1.2 Load Latency Performance Monitoring Facility

The load latency facility provides software a means to charaterize
the average load latency to different levels of cache/memory
hierarchy.  This facility requires processor supporting enhanced
PEBS record format in the PEBS buffer, see Table 18-23.

This field measures the load latency from load's first dispatch of
till final data writeback from the memory subsystem. The latency is
reported for retired demand load operations and in core cycles (it
accounts for re-dispatches).
					

What do we have in perf?



  • perf stat
  • perf mem
  • perf c2c
  • Several articles about using it
  • Improve upon this foundation

perf mem



  • loads, stores
  • load latency
  • Stephane Eranian
  • Jiri Olsa, Kan Liang, others

Global stats


# perf mem record -a sleep 1
#
# perf mem -t load report --sort=mem --stdio
# Total Lost Samples: 0
#
# Samples: 51K of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 4819902
# Sort order   : mem
#
# Overhead       Samples  Memory access
# ........  ............  ........................
    44.87%         20217  LFB or LFB hit
    27.30%         18618  L3 or L3 hit
    22.53%         11712  L1 or L1 hit
     4.85%           637  Local RAM or RAM hit
     0.25%             1  Uncached or N/A hit
     0.20%           188  L2 or L2 hit
     0.00%            35  L3 miss
					

Workload stats


# perf mem record sleep 1
#
# perf mem -t load report --sort=mem --stdio
# Total Lost Samples: 0
#
# Samples: 16  of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 1556
# Sort order   : mem
# Overhead       Samples  Memory access
# ........  ............  ........................
#
    64.52%             8  LFB or LFB hit
    14.07%             4  L1 or L1 hit
    11.05%             3  L3 or L3 hit
    10.35%             1  Local RAM or RAM hit

					

A familiar workload


$ make -j8 O=../build/allmodconfig/
make[1]: Entering directory '/home/acme/git/build/allmodconfig'
<SNIP>
# perf mem record sleep 1m
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (20 samples) ]
#
					

perf mem report


# perf mem report --stdio
# Total Lost Samples: 0
#
# Samples: 11  of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 2155
# Sort order   : local_weight,mem,sym,dso,symbol_daddr,dso_daddr
#
#              Local Mem
#Overhead Weig Access   Symbol              Sh Object   Data Symbol        Data Obj
#........ .... ........ ................... .........   ................   ...........
   23.94% 516  LocalRAM copy_page           [kernel]    0xffff8d42228ea900 [unknown]
   15.31% 330  LFB      flush_signal_handle [kernel]    0xffff8d3f976020a0 [unknown]
   14.66% 316  LFB      strlen              [kernel]    0xffffffff9b5f4cd3 [kernel].ro
   13.36% 288  LFB      _dl_relocate_object ld-linux.so 0x00007f6ccdc23068 libc.so.6
   11.46% 247  LFB      next_uptodate_page  [kernel]    0xffffe401957e4df4 [unknown]
    7.33% 158  LFB      copy_page           [kernel]    0xffff8d41f2dae920 [unknown]
    4.04% 87   LFB      unlock_page_memcg   [kernel]    0xffffe4019333d8b8 [unknown]
    3.06% 66   L1       check_preemption_di [kernel]    0xffffa8e8622ffc80 [unknown]
    2.69% 58   LFB      perf_output_begin   [kernel]    0xffff8d3f52a1b01c [unknown]
    2.13% 46   L3       task_work_run       [kernel]    0xffff8d3f4a9c802c [unknown]
    2.00% 43   L1       kmem_cache_alloc_tr [kernel]    0xffffa8e8622ffbc8 [unknown]
					

perf c2c



  • Cache to Cache
  • False Sharing
  • Cachelines tugged
  • Resolves global variables
  • Should resolve its type and field
  • Using BTF

Origins



  • Dick Fowles
  • HP-UX had something similar
  • Joe Mario
  • Don Zickus
  • Jiri Olsa

c2c output



  • Cachelines where false sharing was detected
  • Readers and writers and cacheline offsets
  • pid, tid, instruction addr, function name, binary object names
  • Source file and line number
  • Average load latency
  • NUMA nodes and CPUs involved

perf c2c


# perf c2c record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 7.787 MB perf.data (2450 samples) ]
# perf evlist
cpu/mem-loads,ldlat=30/P
cpu/mem-stores/P
dummy:HG
#
					

Whats in there?


					
# perf script --cpu 4 --pid 0 | head
 swapper 0 [4] 319242.043904:    58 cpu/mem-loads,ldlat=30/P: ffff8d3e49c0e688 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A      290 0 ffffffff9a13eb2c __update_load_avg_cfs_rq+0x9c (vmlinux) 9c0e688
 swapper 0 [4] 319242.142295:    39 cpu/mem-loads,ldlat=30/P: ffff8d44865f2408 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A        335 0 ffffffff9a13eecd update_rt_rq_load_avg+0x17d (vmlinux) 6465f2408
 swapper 0 [4] 319242.143587: 99614         cpu/mem-stores/P: ffff8d4486500028     5080184 |OP STORE|LVL L1 miss|SNP N/A|TLB N/A|LCK N/A|BLK  N/A                       0 0 ffffffff9a001c2f __switch_to_asm+0x1f (vmlinux) 646500028
 swapper 0 [4] 319242.174494:    33 cpu/mem-loads,ldlat=30/P: ffff8d3f595ddc38 11a68201042 |OP LOAD|LVL Local RAM or RAM hit|SNP Hit|TLB L1 or L2 hit|LCK No|BLK  N/A 176 0 ffffffff9a13e78d __update_load_avg_se+0x1d (vmlinux) 1195ddc38
 swapper 0 [4] 319242.178002:    27 cpu/mem-loads,ldlat=30/P: ffff8d44865312c0 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A         56 0 ffffffff9a07d74f switch_mm_irqs_off+0x16f (vmlinux) 6465312c0
 swapper 0 [4] 319242.212148:    23 cpu/mem-loads,ldlat=30/P: ffff8d44865322e8 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A         55 0 ffffffff9a140c22 irqtime_account_process_tick+0xa2 (vmlinux) 6465322e8
 swapper 0 [4] 319242.217357:    18 cpu/mem-loads,ldlat=30/P: ffff8d4486532490 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A        125 0 ffffffff9a140076 update_irq_load_avg+0xf6 (vmlinux) 646532490
 swapper 0 [4] 319242.220573:    15 cpu/mem-loads,ldlat=30/P: ffff8d3f4f35f218 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A      383 0 ffffffff9a73b407 rb_erase+0x7 (vmlinux) 10f35f218
 swapper 0 [4] 319242.240176:    15 cpu/mem-loads,ldlat=30/P: ffff8d3f6b617be0 10650100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L2 miss|LCK No|BLK  N/A             184 0 ffffffff9a129fbb update_blocked_averages+0x1fb (vmlinux) 12b617be0
 swapper 0 [4] 319242.243441:  8849         cpu/mem-stores/P: ffff8d3f40c2b1a4     5080144 |OP STORE|LVL L1 hit|SNP N/A|TLB N/A|LCK N/A|BLK  N/A                        0 0 ffffffff9ad68aed rcu_eqs_exit.constprop.0+0x3d (vmlinux) 100c2b1a4
#
					
					

What was really asked?


# perf evlist -v | head -1
cpu/mem-loads,ldlat=30/P: type: 4, size: 128, config: 0x1cd, \
{ sample_period, sample_freq }: 4000, \
sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|PHYS_ADDR| \
             WEIGHT_STRUCT, \
read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, \
sample_id_all: 1, { bp_addr, config1 }: 0x1f
#
					

PERF_SAMPLE_DATA_SRC



  • PEBS Load Latency
  • opcode: load, store, prefetch, exec
  • mem level: l1, l2, l3, l4, lfb, ram, pmem, etc
  • tlb: not available, hit, miss, l1, l2, hw walker, OS fault handler
  • snoop: no snoop, hit, miss, hit modified
  • lock: not available, locked instruction

PERF_SAMPLE_WEIGHT_STRUCT



  • hardware provided number
  • How expensive the sampled action represents
  • For profiler to scale samples

c2c perf.data header


# perf report --header-only
# captured on    : Fri Sep  9 11:10:04 2022
# hostname : quaco
# os release : 5.18.17-200.fc36.x86_64
# perf version : 6.0.rc3.gfaf59ec8c3c3
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
# total memory : 24487388 kB
# cmdline : /home/acme/bin/perf c2c record -a sleep 1m
# event : name = cpu/mem-loads,ldlat=30/P, freq = 4000,
  sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\
                PHYS_ADDR|WEIGHT_STRUCT
# event : name = cpu/mem-stores/P, freq = 4000,
  sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\
                PHYS_ADDR|WEIGHT_STRUCT
					

Key metrics


# perf c2c report --stats
  Total records                     :    3223429
  Locked Load/Store Operations      :     112673
  Load Operations                   :    1387118
  Loads - uncacheable               :          1
  Loads - IO                        :          4
  Loads - Miss                      :        142
  Loads - no mapping                :       2350
  Load Fill Buffer Hit              :     455747
  Load L1D hit                      :     264355
  Load L2D hit                      :      29304
  Load LLC hit                      :     534642
  Load Local HITM                   :        629
  Load Remote HITM                  :          0
  Load Remote HIT                   :          0
					

perf c2c contended cachelines view


# perf c2c report --stdio
=================================================
           Shared Data Cache Line Table
=================================================
#
#      -- Cacheline --    -- Load Hitm  ----   Tot  Total   Total  -- Stores --
# Idx           Address    Hitm  Tot LclHitm   rec  Loads  Stores  L1Hit L1Miss
# ...  ................   .....  ... .......   ...  .....  ......  ............
#
    0  ffff8d449e7d6380   8.43%   53      53   510    499      11     11      0
    1  ffff8d4058209340   6.20%   39      39   371    135     236    223     13
    2  ffff8d449e7ff400   5.88%   37      37   501    479      22     22      0
    3  ffffffff9bf53980   4.93%   31      31   233    208      25     24      1
    4  ffff8d3f49ebd280   3.18%   20      20   162    153       9      9      0
    5  ffff8d3f420d4880   2.86%   18      18   126    121       5      5      0
					

tugged cacheline


Cacheline 0xffff8d449e7ff400
-HITM-     CL                    --- cycles ---  Tot  cpu
LclHitm   Off      Code address  lcl hitm  load  rec  cnt   Symbol Object      Source:Line
 97.30%   0x0 0xffffffff9a2d293b      113    44  454   8 __mod_node_page_state [kernel] vmstat.c:379
  0.00%   0x8 0xffffffff9a2d29bb        0   112  40    8 __mod_node_page_state [kernel] atomic64_64.h:46
  2.70%  0x18 0xffffffff9a2d2be5      959   103   2    2 refresh_cpu_vm_stats  [kernel] atomic64_64.h:46
					

Next steps



  • Look at the CL Off (Cacheline Offset)
  • Three fields being accessed
  • Two are with local HITM
  • Find out what is the data structure
  • By looking at the functions, source:line

vmstat.c:379


$ perf probe -L vmstat.c:379 | head

    379  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
    380  	s8 __percpu *p = pcp->vm_node_stat_diff + item;
    381  	long x;
         	long t;

         	if (vmstat_item_in_bytes(item)) {
         		/*
         		 * Only cgroups use subpage accounting right now; at
         		 * the global level, these items still change in
$
					

Looking at pgdat


$ pfunct __mod_node_page_state
void __mod_node_page_state(struct pglist_data * pgdat,
                           enum node_stat_item item, long int delta);
$
$ pahole pglist_data | grep -B2 -A6 per_cpu_nodestats
    /* --- cacheline 2704 boundary (173056 bytes) --- */
    struct zone_padding      _pad2_;              /* 173056   0 */
    struct per_cpu_nodestat *per_cpu_nodestats
    		__attribute__((__aligned__(64))); /* 173056   8 */
    atomic_long_t            vm_stat[41];         /* 173064 328 */

    /* size: 173440, cachelines: 2710, members: 32 */
    /* sum members: 173309, holes: 6, sum holes: 83 */
    /* padding: 48 */
    /* forced alignments: 2 */
$
					

Code inspection



  • Manual steps
  • Streamlining the current UI:
  • Should go from the c2c TUI
  • To the 'perf annotate' browser
  • Positioning at that source:line
  • ENTER on type: go to a pahole browser

refresh_cpu_vm_stats


# perf annotate --stdio2 refresh_cpu_vm_stats
<SNIP>
refresh_cpu_vm_stats() /usr/lib/debug/lib/modules/5.18.17-200.fc36.x86_64/vmlinux
             ffffffff812d29e0 :
             static int refresh_cpu_vm_stats(bool do_pagesets)
             struct pglist_data *pgdat;
             struct zone *zone;
<SNIP>
             for_each_online_pgdat(pgdat) {
<SNIP>
        1f3:   cmpxchg %r8b,%gs:(%rdx)
             ↑ jne     1f3
               movsbl  %al,%edi
             if (v) {
               test    %edi,%edi
             ↓ je      20b
             atomic_long_add(v, &pgdat->vm_stat[i])


					

Being mindful


 * Update the zone counters for the current cpu.

 * Note that refresh_cpu_vm_stats strives to only access node local
 * memory.  The per cpu pagesets on remote zones are placed in the
 * memory local to the processor using that pageset. So the loop over
 * all zones will access a series of cachelines local to the processor.

 * The call to zone_page_state_add updates the cachelines with the stats
 * in the remote zone struct as well as the global cachelines with the
 * global counters. These could cause remote node cache line bouncing
 * and will have to be only done when necessary.

 * The function returns the number of global counters updated.
					

pahole TODO



  • We have --hex, from struct start
  • Implement per-cacheline offsets in pahole output
  • For multi cacheline structs
  • Like pgdat in the previous example

What do we see?



  • Code
  • Source code/line
  • All focused where things happened
  • Not on what data
  • But looking at where it happens
  • Helps figuring out the data structure accessed

No rmap for data symbols?



  • Global vars: ok
  • SLAB: can we do it?
  • From code?
  • Going back to args
  • DWARF location expressions
  • LBR?

perf lock contention -b



  • Gets some backtraces
  • Identifies locks by backtrace sigs
  • Good enough?

A detour



  • perf record + report is too costly
  • In kernel aggregation should be supported
  • perf kmem, kwork, bcounters do it
  • BPF way of doing things
  • perf using it

perf BPF skels


$ wc -l tools/perf/util/bpf_skel/*.bpf.c
  191 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
   78 tools/perf/util/bpf_skel/bperf_follower.bpf.c
   55 tools/perf/util/bpf_skel/bperf_leader.bpf.c
   92 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
  116 tools/perf/util/bpf_skel/func_latency.bpf.c
  383 tools/perf/util/bpf_skel/kwork_trace.bpf.c
  175 tools/perf/util/bpf_skel/lock_contention.bpf.c
  273 tools/perf/util/bpf_skel/off_cpu.bpf.c
 1363 total
$
					

perf lock contention -b


 $ sudo perf lock contention -b
^C
contended  total wait   max wait  avg wait      type  caller

        42   192.67 us  13.64 us   4.59 us  spinlock  queue_work_on+0x20
        23    85.54 us  10.28 us   3.72 us  spinlock  worker_thread+0x14a
         6    13.92 us   6.51 us   2.32 us     mutex  kernfs_iop_permission+0x30
         3    11.59 us  10.04 us   3.86 us     mutex  kernfs_dop_revalidate+0x3c
         1     7.52 us   7.52 us   7.52 us  spinlock  kthread+0x115
         1     7.24 us   7.24 us   7.24 us  rwlock:W  sys_epoll_wait+0x148
         2     7.08 us   3.99 us   3.54 us  spinlock  delayed_work_timer_fn+0x1b
         1     6.41 us   6.41 us   6.41 us  spinlock  idle_balance+0xa06
         2     2.50 us   1.83 us   1.25 us     mutex  kernfs_iop_lookup+0x2f
         1     1.71 us   1.71 us   1.71 us     mutex  kernfs_iop_getattr+0x2c
...
					

perf sampling in BPF



  • We have BPF programs counting events
  • That perf then treats as a counter (evsel)
  • Now for sampling
  • Do the c2c aggregation in a BPF program
  • Until control+C
  • Or some other existing perf stop method
  • End of a control workload (sleep), etc
  • Recent discussion: Namhyung Kim and Song Liu

BTF



  • per-cpu variables
  • patch to record all variables
  • And have it in a kernel module
  • Load it and get access to type info
  • For all variables

Code Annotation



  • perf annotate
  • Source code
  • Machine code
  • Where events take place
perf annotate
nospectre_v1 + nospectre_v2

perf + pahole



  • Data structure annotation
  • Show HITM causing fields
  • Mapping data memory to a variable
  • Finding its type
  • BTF info for variables: types
  • Function signatures: arg types

Data profiling links


New tools



  • perf lock contention
  • BPF becoming the way to write new tools
  • BPF skels
  • Show a more recent example (prometheusnoop) that traces golang code (talk at tracing summit the week before)

THE END