acme@redhat.com https://fedorapeople.org/~acme/prez/perf-tools-CERN-2023 |
---|
git log --committer=acme@redhat.com --no-merges v6.5.. . Changesets: 215 1 80 Ian Rogers Google 2 32 Arnaldo Carvalho de Melo Red Hat 3 20 Kajol Jain IBM 4 15 Namhyung Kim Google 5 15 Athira Rajeev IBM 6 9 Anup Sharma GSoC 7 8 James Clark ARM 8 6 Leo Yan Linaro 9 4 Ilkka Koskinen Ampere Computing 10 3 Yanteng Si Longson 11 3 Jing Zhang Alibaba 12 3 Adrian Hunter Intel 13 2 Thomas Richter IBM 14 2 Wei Li Huawei 15 2 Alexandre Ghiti Rivosinc 16 1 Kaige Ye 17 1 Yang Jihong Huawei 18 1 Xiu Jianfeng Huawei 19 1 Masami Hiramatsu Linaro 20 1 Lu Hongfei Vivo 21 1 John Garry Oracle 22 1 Ivan Babrou Cloudflare 23 1 Changbin Du Huawei 24 1 Guilherme Amadio CERN 25 1 Akanksha J N IBM 26 1 Aditya Gupta IBM
git log --committer=acme@redhat.com --no-merges v6.0.. --no-merges v6.0.. Changesets: 1578 1 627 Ian Rogers Google 2 238 Namhyung Kim Google 3 172 Arnaldo Carvalho de Melo Red Hat 4 77 Adrian Hunter Intel 5 53 James Clark ARM 6 34 Leo Yan Linaro 7 34 Athira Rajeev IBM 8 23 Kajol Jain IBM 9 21 Thomas Richter IBM 10 20 Ravi Bangoria AMD 11 19 Kan Liang Intel 12 14 Tiezhu Yang Loongson 13 13 Carsten Haitzler ARM 14 12 Yang Jihong Huawei 15 12 Jing Zhang Alibaba 16 11 Shang XiaoJing Huawei 17 10 Yanteng Si Longson 18 10 Changbin Du Huawei 19 10 Anup Sharma GSoC 20 9 Sandipan Das AMD 21 8 German Gomez ARM $
$ export BUILD_TARBALL=http://192.168.86.10/perf/perf-6.5.0-rc5.tar.xz $ dm 1 almalinux:8 : Ok gcc 8.5.0, clang 15.0.7, flex 2.6.1 2 almalinux:9 : Ok gcc 11.3.1, clang 15.0.7, flex 2.6.4 3 alpine:3.15 : Ok gcc 10.3.1, Alpine clang 12.0.1, flex 2.6.4 4 alpine:3.16 : Ok gcc 11.2.1, Alpine clang 13.0.1, flex 2.6.4 5 alpine:3.17 : Ok gcc 12.2.1, Alpine clang 15.0.7, flex 2.6.4 6 alpine:3.18 : Ok gcc 12.2.1, Alpine clang 16.0.6, flex 2.6.4 7 alpine:edge : Ok gcc 13.1.1, Alpine clang 16.0.6, flex 2.6.4 8 amazonlinux:2 : Ok gcc 7.3.1 20180712, clang 11.1.0, flex 2.5.37 9 amazonlinux:2023 : Ok gcc 11.3.1, clang 15.0.6, flex 2.6.4 10 amazonlinux:devel : Ok gcc 11.3.1, clang 15.0.6, flex 2.6.4 11 archlinux:base : Ok gcc 13.2.1, clang 15.0.7, flex 2.6.4 12 centos:stream : Ok gcc 8.5.0, clang 16.0.6, flex 2.6.1 13 clearlinux:latest : Ok gcc 13.2.1, clang 16.0.6, flex 2.6.4 14 debian:10 : Ok gcc 8.3.0, clang 11.0.1-2~deb10u1, flex 2.6.4 15 debian:11 : Ok gcc 10.2.1, clang 13.0.1-6~deb11u1, flex 2.6.4 16 debian:12 : Ok gcc 12.2.0, clang 14.0.6, flex 2.6.4 17 debian:experimental : Ok gcc 13.2.0, clang 14.0.6, flex 2.6.4 18 debian:experimental-x-arm64 : Ok aarch64-linux-gnu-gcc 13.2.0, flex 2.6.4 19 debian:experimental-x-mips : Ok mips-linux-gnu-gcc 12.3.0, flex 2.6.4 20 debian:experimental-x-mips64 : Ok mips64-linux-gnuabi64-gcc 12.3.0, flex 2.6.4 21 debian:experimental-x-mipsel : Ok mipsel-linux-gnu-gcc 12.3.0, flex 2.6.4 22 fedora:35 : Ok gcc 11.3.1, clang 13.0.1, flex 2.6.4 23 fedora:36 : Ok gcc 12.2.1, clang 14.0.5, flex 2.6.4 24 fedora:37 : Ok gcc 12.3.1, clang 15.0.7, flex 2.6.4 25 fedora:38 : Ok gcc 13.2.1, clang 16.0.6, flex 2.6.4 26 fedora:39 : Ok gcc 13.2.1, clang 16.0.6, flex 2.6.4 27 fedora:rawhide : Ok gcc 13.2.1, clang 16.0.6, flex 2.6.4
28 gentoo:stage3 : Ok gcc 12.3.1, clang 16.0.6, flex 2.6.4 29 manjaro:base : Ok gcc 13.2.1, clang 15.0.7, flex 2.6.4 30 opensuse:15.4 : Ok gcc 7.5.0, clang 15.0.7, flex 2.6.4 31 opensuse:15.5 : Ok gcc 7.5.0 clang 15.0.7, flex 2.6.4 32 opensuse:tumbleweed : Ok gcc 13.1.1, clang 16.0.6, flex 2.6.4 33 oraclelinux:8 : Ok gcc 8.5.0, clang 15.0.7, flex 2.6.1 34 oraclelinux:9 : Ok gcc 11.3.1, clang 15.0.7, flex 2.6.4 35 rockylinux:8 : Ok gcc 8.5.0, clang 15.0.7, flex 2.6.1 36 rockylinux:9 : Ok gcc 11.3.1, clang 15.0.7, flex 2.6.4 37 ubuntu:18.04 : Ok gcc 7.5.0, flex 2.6.4 38 ubuntu:18.04-x-arm : Ok arm-linux-gnueabihf-gcc 7.5.0, flex 2.6.4 39 ubuntu:18.04-x-arm64 : Ok aarch64-linux-gnu-gcc 7.5.0, flex 2.6.4 40 ubuntu:18.04-x-m68k : Ok m68k-linux-gnu-gcc 7.5.0, flex 2.6.4 41 ubuntu:18.04-x-powerpc : Ok powerpc-linux-gnu-gcc 7.5.0, flex 2.6.4 42 ubuntu:18.04-x-powerpc64 : Ok powerpc64-linux-gnu-gcc 7.5.0, flex 2.6.4 43 ubuntu:18.04-x-powerpc64el : Ok powerpc64le-linux-gnu-gcc 7.5.0, flex 2.6.4 44 ubuntu:18.04-x-riscv64 : Ok riscv64-linux-gnu-gcc 7.5.0, flex 2.6.4 45 ubuntu:18.04-x-s390 : Ok s390x-linux-gnu-gcc 7.5.0, flex 2.6.4 46 ubuntu:18.04-x-sh4 : Ok sh4-linux-gnu-gcc 7.5.0, flex 2.6.4 47 ubuntu:18.04-x-sparc64 : Ok sparc64-linux-gnu-gcc 7.5.0, flex 2.6.4 48 ubuntu:20.04 : Ok gcc 9.4.0, flex 2.6.4 49 ubuntu:20.04-x-powerpc64el : Ok powerpc64le-linux-gnu-gcc 10.5.0, flex 2.6.4 50 ubuntu:22.04 : Ok gcc 11.4.0, clang 14.0.0-1ubuntu1.1, flex 2.6.4 51 ubuntu:22.10 : Ok gcc 12.2.0, clang 15.0.7, flex 2.6.4 52 ubuntu:23.04 : Ok gcc 12.3.0, clang 15.0.7, flex 2.6.4 53 ubuntu:23.10 : Ok gcc 13.2.0, clang 16.0.6, flex 2.6.4 $
# perf test 1: vmlinux symtab matches kallsyms : Ok 2: Detect openat syscall event : Ok 3: Detect openat syscall event on all cpus : Ok 4: mmap interface tests : 4.1: Read samples using the mmap interface : Ok 4.2: User space counter reading of instructions : Ok 4.3: User space counter reading of cycles : Ok 5: Test data source output : Ok 6: Parse event definition strings : 6.1: Test event parsing : Ok 6.2: Parsing of all PMU events from sysfs : Ok 6.3: Parsing of given PMU events from sysfs : Ok 6.4: Parsing of aliased events from sysfs : Skip (no aliases in sysfs) 6.5: Parsing of aliased events : Ok 6.6: Parsing of terms (event modifiers) : Ok 7: Simple expression parser : Ok 8: PERF_RECORD_* events & perf_sample fields : Ok 9: Parse perf pmu format : Ok 10: PMU events : 10.1: PMU event table sanity : Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok 10.5: Parsing of metric thresholds with fake PMUs : Ok 11: DSO data read : Ok 12: DSO data cache : Ok 13: DSO data reopen : Ok 14: Roundtrip evsel->name : Ok 15: Parse sched tracepoints fields : Ok 16: syscalls:sys_enter_openat event fields : Ok 17: Setup struct perf_event_attr : Ok
18: Match and link multiple hists : Ok 19: 'import perf' in python : Ok 20: Breakpoint overflow signal handler : Ok 21: Breakpoint overflow sampling : Ok 22: Breakpoint accounting : Ok 23: Watchpoint : 23.1: Read Only Watchpoint : Skip (missing hardware support) 23.2: Write Only Watchpoint : Ok 23.3: Read / Write Watchpoint : Ok 23.4: Modify Watchpoint : Ok 24: Number of exit events of a simple workload : Ok 25: Software clock events period values : Ok 26: Object code reading : Ok 27: Sample parsing : Ok 28: Use a dummy software event to keep tracking : Ok 29: Parse with no sample_id_all bit set : Ok 30: Filter hist entries : Ok 31: Lookup mmap thread : Ok 32: Share thread maps : Ok 33: Sort output of hist entries : Ok 34: Cumulate child hist entries : Ok 35: Track with sched_switch : Ok 36: Filter fds with revents mask in a fdarray : Ok 37: Add fd to a fdarray, making it autogrow : Ok 38: kmod_path__parse : Ok 39: Thread map : Ok 40: Session topology : Ok 41: Synthesize thread map : Ok 42: Remove thread map : Ok
43: CPU map : 43.1: Synthesize cpu map : Ok 43.2: Print cpu map : Ok 43.3: Merge cpu map : Ok 43.4: Intersect cpu map : Ok 43.5: Equal cpu map : Ok 44: Synthesize stat config : Ok 45: Synthesize stat : Ok 46: Synthesize stat round : Ok 47: Synthesize attr update : Ok 48: Event times : Ok 49: Read backward ring buffer : Ok 50: Probe SDT events : Ok 51: is_printable_array : Ok 52: Print bitmap : Ok 53: perf hooks : Ok 54: unit_number__scnprintf : Ok 55: mem2node : Ok 56: time utils : Ok 57: Test jit_write_elf : Ok 58: Test libpfm4 support : 58.1: test of individual --pfm-events : Ok 58.2: test groups of --pfm-events : Ok 59: Test api io : Ok 60: maps__merge_in : Ok 61: Demangle Java : Ok 62: Demangle OCaml : Ok 63: Parse and process metrics : Ok 64: PE file support : Skip
65: Event expansion for cgroups : Ok 66: Convert perf time to TSC : 66.1: TSC support : Ok 66.2: Perf time to TSC : Ok 67: dlfilter C API : Ok 68: Sigtrap : Ok 69: Event groups : Skip 70: Symbols : Ok 71: Test dwarf unwind : Ok 72: Intel PT : 72.1: Intel PT packet decoder : Ok 72.2: Intel PT hybrid CPU compatibility : Skip 73: x86 bp modify : Ok 74: x86 Sample parsing : Ok 75: AMD IBS via core pmu : Skip 76: x86 hybrid : Skip (not hybrid) 77: build id cache operations : Ok 78: CoreSight / ASM Pure Loop : Skip 79: CoreSight / Memcpy 16k 10 Threads : Skip 80: CoreSight / Thread Loop 10 Threads - Check TID : Skip 81: CoreSight / Thread Loop 2 Threads - Check TID : Skip 82: CoreSight / Unroll Loop Thread 10 : Skip 83: daemon operations : Ok 84: kernel lock contention analysis test : Ok 85: perf pipe recording and injection test : Ok 86: Add vfs_getname probe to get syscall args filenames : Ok 87: probe libc's inet_pton & backtrace it with ping : Ok 88: Use vfs_getname probe to get syscall args filenames : Ok 89: Zstd perf.data compression/decompression : Ok
90: perf record tests : Ok 91: perf record sample filtering (by BPF) tests : Ok 92: perf record offcpu profiling tests : Ok 93: perf stat CSV output linter : Ok 94: perf stat csv summary test : Ok 95: perf stat JSON output linter : Ok 96: perf stat metrics (shadow stat) test : Ok 97: perf stat STD output linter : Ok 98: perf stat tests : Ok 99: perf all metricgroups test : Ok 100: perf all metrics test : FAILED! 101: perf all libpfm4 events test : Ok 102: perf all PMU test : Ok 103: perf stat --bpf-counters test : Ok 104: perf stat --bpf-counters --for-each-cgroup test : Ok 105: perf metrics value validation : Ok 106: Check Arm64 callgraphs are complete in fp mode : Skip 107: Check Arm CoreSight trace data recording and synthesized samples: Skip 108: Check Arm SPE trace data recording and synthesized samples : Skip 109: Check Arm SPE doesn't hang when there are forks : Skip 110: Check branch stack sampling : Ok 111: Test data symbol : Ok 112: Miscellaneous Intel PT testing : Ok 113: Test java symbol : Skip 114: 'perf data convert --to-json' command test : Ok 115: perf script task-analyzer tests : Ok 116: test perf probe of function from different CU : Ok 117: Check open filename arg using perf trace + vfs_getname : Ok #
# vmstat 1 procs -----------memory---------- -swap- ---io--- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 8 0 2000272 2128480 483836 15791896 7 9 582 992 42 44 39 6 55 0 0 8 0 2000272 2021504 483852 15802480 0 0 136 2520 8764 3656 84 16 0 0 0 8 0 2000272 2113768 483876 15804436 0 0 76 2352 8740 3385 83 17 0 0 0 8 0 2000272 1978844 483876 15812104 0 0 164 260 8657 3070 85 15 0 0 0 ^C #
# perf stat --interval-print 1000 -e context-switches # time counts unit events 1.000157779 2,627 context-switches 2.000338693 3,067 context-switches 3.000500966 3,577 context-switches 4.000655707 2,976 context-switches 5.000808482 3,199 context-switches ^C 5.922384403 3,333 context-switches #
# perf stat -I 1000 -e cs -t `pidof -S, bash` # time counts unit events 1.000217556 12 cs 2.000423764 11 cs 3.000606747 1 cs 4.000817161 1 cs 5.001014081 <not counted> cs ^C 5.976337203 <not counted> cs #
# perf stat -I 1000 -e msr/smi/ # time counts unit events 1.000532552 0 msr/smi/ 2.001166107 0 msr/smi/ 3.001675027 8 msr/smi/ 4.002176974 0 msr/smi/ 5.002726317 8 msr/smi/ ^C 5.637141913 0 msr/smi/ #
# echo 1 > /proc/sys/vm/drop_caches # perf stat find ~acme/git/linux > /dev/null Performance counter stats for 'find /home/acme/git/linux': 318.11 msec task-clock # 0.420 CPUs utilized 4,989 context-switches # 0.016 M/sec 63 cpu-migrations # 0.198 K/sec 280 page-faults # 0.880 K/sec 417,896,513 cycles # 1.314 GHz 379,620,453 instructions # 0.91 insn per cycle 75,490,217 branches # 237.312 M/sec 2,036,356 branch-misses # 2.70% of all branches 0.758165914 seconds time elapsed 0.073100000 seconds user 0.241968000 seconds sys #
# perf stat find ~acme/git/linux > /dev/null Performance counter stats for 'find /home/acme/git/linux': 68.28 msec task-clock # 0.991 CPUs utilized 12 context-switches # 0.176 K/sec 0 cpu-migrations # 0.000 K/sec 274 page-faults # 0.004 M/sec 265,617,729 cycles # 3.890 GHz 310,931,753 instructions # 1.17 insn per cycle 61,838,290 branches # 905.609 M/sec 1,508,769 branch-misses # 2.44% of all branches 0.068907995 seconds time elapsed 0.018829000 seconds user 0.049568000 seconds sys #
# tools/perf/perf-completion.sh # perf stat -e power/energy-power/energy-cores/ power/energy-gpu/ power/energy-pkg/ power/energy-psys/ power/energy-ram/
# perf stat -e power/energy-cores/,power/energy-gpu/, power/energy-pkg/,power/energy-psys/, power/energy-ram/ sleep 1 Performance counter stats for 'system wide': 0.83 Joules power/energy-cores/ 0.01 Joules power/energy-gpu/ 1.66 Joules power/energy-pkg/ 5.98 Joules power/energy-psys/ 0.54 Joules power/energy-ram/ 1.001084656 seconds time elapsed #
Tasks: 299 total, 1 running, 297 sleeping, 1 stopped, 0 zombie %Cpu(s): 1.1 us, 0.5 sy, 0.0 ni, 98.0 id, 0.1 wa, 0.1 hi, 0.2 si, 0.0 st MiB Mem : 23980.2 total, 6982.2 free, 4027.3 used, 12970.8 buff/cache MiB Swap: 12056.0 total, 12056.0 free, 0.0 used. 18803.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9137 acme 20 0 2515848 839908 120176 S 3.0 3.4 12:14.64 Web Content 8888 acme 20 0 2261904 562132 147104 S 2.6 2.3 5:24.73 Web Content 2727 acme 20 0 689068 67888 43064 S 1.3 0.3 2:50.71 gnome-terminal- 9093 acme 20 0 2221712 350912 121988 S 1.0 1.4 2:11.17 Web Content 989 avahi 20 0 32776 5116 4496 S 0.7 0.0 0:16.37 avahi-daemon 7922 root 20 0 228896 5148 4268 R 0.7 0.0 0:00.06 top 8786 acme 20 0 2661872 425496 187100 S 0.7 1.7 13:37.34 firefox 8959 acme 20 0 1581656 129824 82692 S 0.7 0.5 0:26.55 WebExtensions 1996 acme 9 -11 2092664 14264 10804 S 0.3 0.1 0:04.55 pulseaudio 2086 acme 20 0 3298140 211612 118636 S 0.3 0.9 6:36.83 gnome-shell 7995 qemu 20 0 5256108 690860 22588 S 0.3 2.8 1:22.68 qemu-system-x86 9050 acme 20 0 2239376 585160 116440 S 0.3 2.4 1:57.69 Web Content 29175 root 20 0 253012 30208 8636 S 0.3 0.1 0:12.01 sssd_kcm
# perf top --hierarchy Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0 + 57.16% cc1 + 19.39% [kernel] + 11.64% libc-2.28.so + 2.08% perf + 1.22% genksyms + 1.16% ccache + 1.05% as + 0.99% ld-2.28.so + 0.86% make + 0.84% libpixman-1.so.0.34.0 + 0.65% bash + 0.41% libelf-0.174.so + 0.32% objtool + 0.24% libglib-2.0.so.0.5800.2 + 0.17% gcc + 0.06% perf-8888.map + 0.03% [i915] + 0.02% [crc32c_intel]
# perf top --hierarchy Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0 + 61.70% cc1 + 16.39% [kernel] - 11.66% libc-2.28.so 2.54% [.] _int_malloc 1.51% [.] malloc 1.39% [.] _int_free 0.72% [.] __strcmp_avx2 0.65% [.] cfree@GLIBC_2.2.5 0.51% [.] __strlen_avx2 0.45% [.] __memcmp_avx2_movbe 0.34% [.] __memset_avx2_erms 0.32% [.] malloc_consolidate 0.32% [.] __memchr_avx2 0.30% [.] __memmove_avx_unaligned_erms 0.29% [.] __strncmp_avx2 0.21% [.] getenv 0.15% [.] __dcigettext + 1.24% ccache + 0.78% ld-2.28.so
# perf top --hierarchy Samples: 4K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 1006575623 lost: 0/0 drop: 0/0 + 61.70% cc1 - 16.39% [kernel] 1.13% [k] native_flush_tlb_one_user 0.56% [k] unmap_page_range 0.48% [k] native_irq_return_iret 0.46% [k] clear_page_erms 0.42% [k] filemap_map_pages 0.40% [k] copy_user_enhanced_fast_string 0.38% [k] sync_regs 0.37% [k] syscall_return_via_sysret 0.37% [k] entry_SYSCALL_64 0.35% [k] common_interrupt 0.33% [k] error_entry 0.31% [k] __list_del_entry_valid 0.30% [k] __handle_mm_fault 0.24% [k] get_page_from_freelist 0.24% [k] __d_lookup_rcu 0.20% [k] vma_interval_tree_insert 0.19% [k] release_pages 0.19% [k] handle_mm_fault 0.19% [k] update_blocked_averages 0.18% [k] avc_has_perm_noaudit 0.15% [k] page_remove_rmap 0.14% [k] __pagevec_lru_add_fn 0.14% [k] _raw_spin_lock + 1.29% genksyms + 1.24% ccache + 0.78% ld-2.28.so
# perf top -h percent Usage: perf top [] --call-graph setup and enables call-graph (stack chain/backtrace): record_mode: call graph recording mode (fp|dwarf|lbr) record_size: if record_mode is 'dwarf', max size of stack recording ( ) default: 8192 (bytes) print_type: call graph printing style (graph|flat|fractal|folded|none) threshold: minimum call graph inclusion threshold ( ) print_limit: maximum number of call graph entry ( ) order: call graph order (caller|callee) sort_key: call graph sort key (function|address) branch: include last branch info to call graph (branch) value: call graph value (percent|period|count) Default: fp,graph,0.5,caller,function --percent-limit Don't show entries under that percent --percentage How to display percentage of filtered entries #
# perf probe -x /lib64/libc-2.28.so malloc
Added new event: probe_libc:malloc (on malloc in /usr/lib64/libc-2.28.so) You can now use it in all perf tools, such as: perf record -e probe_libc:malloc -aR sleep 1
# perf stat -e probe_libc:malloc sleep 1
Performance counter stats for 'sleep 1': 31 probe_libc:malloc 1.001535345 seconds time elapsed 0.001071000 seconds user 0.000000000 seconds sys #
# perf probe -L icmp_rcv ... 52 if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) { /* * RFC 1122: 3.2.2.6 An ICMP_ECHO to broadcast MAY be * silently ignored (we let user decide with a sysctl). * RFC 1122: 3.2.2.8 An ICMP_TIMESTAMP MAY be silently * discarded if to broadcast/multicast. */ 59 if ((icmph->type == ICMP_ECHO || 60 icmph->type == ICMP_TIMESTAMP) && net->ipv4.sysctl_icmp_echo_ignore_broadcasts) { goto error; } 64 if (icmph->type != ICMP_ECHO && icmph->type != ICMP_TIMESTAMP && 66 icmph->type != ICMP_ADDRESS && icmph->type != ICMP_ADDRESSREPLY) { goto error; } } ...
# perf probe icmp_rcv:59 Added new events: probe:icmp_rcv (on icmp_rcv:59) probe:icmp_rcv_1 (on icmp_rcv:59) You can now use it in all perf tools, such as: perf record -e probe:icmp_rcv_1 -aR sleep 1 #
# On window 1: # ping -c 2 -b 127.255.255.255 WARNING: pinging broadcast address PING 127.255.255.255 (127.255.255.255) 56(84) bytes of data. --- 127.255.255.255 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 54ms # # On window 2: # perf trace -e probe:icmp_rcv* 0.000 ping/2126 probe:icmp_rcv_1:(ffffffff9186a9d2) 0.002 ping/2126 probe:icmp_rcv:(ffffffff9186a9d2) 1052.532 ping/2126 probe:icmp_rcv_1:(ffffffff9186a9d2) 1052.534 ping/2126 probe:icmp_rcv:(ffffffff9186a9d2) ^C#
# perf trace --call-graph=dwarf -e probe:icmp_rcv* 0.000 ping/28036 probe:icmp_rcv_1:(ffffffff9186a9d2) ip_local_deliver_finish ([kernel.kallsyms]) ip_local_deliver ([kernel.kallsyms]) ip_rcv ([kernel.kallsyms]) __netif_receive_skb_one_core ([kernel.kallsyms]) process_backlog ([kernel.kallsyms]) net_rx_action ([kernel.kallsyms]) do_softirq ([kernel.kallsyms]) __local_bh_enable_ip ([kernel.kallsyms]) ip_finish_output2 ([kernel.kallsyms]) ip_output ([kernel.kallsyms]) ip_send_skb ([kernel.kallsyms]) raw_sendmsg ([kernel.kallsyms]) sock_sendmsg ([kernel.kallsyms]) __sys_sendto ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) __libc_sendto (/usr/lib64/libc-2.28.so) ping4_send_probe (/usr/bin/ping) pinger (/usr/bin/ping) main_loop (/usr/bin/ping) ping4_run (/usr/bin/ping) main (/usr/bin/ping) __libc_start_main (/usr/lib64/libc-2.28.so) _start (/usr/bin/ping)
$ perf stat sleep 1 Performance counter stats for 'sleep 1': 1.42 msec task-clock:u # 0.001 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 78 page-faults:u # 54.934 K/sec 708,831 cycles:u # 0.499 GHz 476,967 instructions:u # 0.67 insn per cycle 106,465 branches:u # 74.982 M/sec 11,030 branch-misses:u # 10.36% of all branches 1.002618642 seconds time elapsed 0.000000000 seconds user 0.002129000 seconds sys $
$ sudo perf stat -a sleep 1 Performance counter stats for 'system wide': 8,017.36 msec cpu-clock # 7.998 CPUs utilized 1,379 context-switches # 172.002 /sec 19 cpu-migrations # 2.370 /sec 603 page-faults # 75.212 /sec 221,872,379 cycles # 0.028 GHz 90,264,445 instructions # 0.41 insn per cycle 18,078,489 branches # 2.255 M/sec 1,134,291 branch-misses # 6.27% of all branches 1.002463688 seconds time elapsed $ $ getconf _NPROCESSORS_ONLN 8 $
# grep 'model name' -m1 /proc/cpuinfo model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
# perf list hw List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] #
# perf stat -e cache-misses,LONGEST_LAT_CACHE.MISS ^C Performance counter stats for 'system wide': 13,261,973 cache-misses 13,262,361 LONGEST_LAT_CACHE.MISS 3.612974091 seconds time elapsed # perf list longest_lat_cache.miss cache: longest_lat_cache.miss [Core-originated cacheable demand requests missed L3 Spec update: SKL057] #
$ git grep PERF_COUNT_HW_CACHE_MISSES arch/ | tail arch/x86/events/amd/core.c: [PERF_COUNT_HW_CACHE_MISSES] = 0x0964, arch/x86/events/intel/core.c: [PERF_COUNT_HW_CACHE_MISSES] = 0x412e, arch/x86/events/intel/core.c: { PERF_COUNT_HW_CACHE_MISSES, "cache misses" }, arch/x86/events/intel/knc.c: [PERF_COUNT_HW_CACHE_MISSES] = 0x0029, arch/x86/events/intel/p4.c: [PERF_COUNT_HW_CACHE_MISSES] = arch/x86/events/intel/p6.c: [PERF_COUNT_HW_CACHE_MISSES] = 0x012e, /* L2_RQSTS:I */ arch/x86/events/zhaoxin/core.c: [PERF_COUNT_HW_CACHE_MISSES] = 0x051a, arch/x86/events/zhaoxin/core.c: { PERF_COUNT_HW_CACHE_MISSES, "cache misses" }, arch/x86/events/zhaoxin/core.c: zx_pmon_event_map[PERF_COUNT_HW_CACHE_MISSES] = 0; arch/xtensa/kernel/perf_event.c: [PERF_COUNT_HW_CACHE_MISSES] = XTENSA_PMU_MASK(12, 0x1), $
# perf stat -a -e cache-misses,r412e sleep 5s Performance counter stats for 'system wide': 12,961,754 cache-misses 12,962,089 r412e 5.003567616 seconds time elapsed #
tools/perf/pmu-events/arch/x86/skylake/cache.json { "BriefDescription": "Core-originated cacheable demand requests missed L3", "Errata": "SKL057", "EventCode": "0x2E", "EventName": "LONGEST_LAT_CACHE.MISS", "PublicDescription": "Counts core-originated cacheable requests that miss the L3 cache (Longest Latency cache). Requests include data and code reads, Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches from L1 and L2. It does not include all misses to the L3.", "SampleAfterValue": "100003", "UMask": "0x41" },
# perf stat -a -e cache-misses,r412e,LONGEST_LAT_CACHE.MISS sleep 5s Performance counter stats for 'system wide': 11,621,661 cache-misses 11,621,982 r412e 11,622,200 LONGEST_LAT_CACHE.MISS 5.003476145 seconds time elapsed #
# perf list cache List of pre-defined events (to be used in -e or -M): cpu: L1-dcache-loads OR cpu/L1-dcache-loads/ L1-dcache-load-misses OR cpu/L1-dcache-load-misses/ L1-dcache-stores OR cpu/L1-dcache-stores/ L1-icache-load-misses OR cpu/L1-icache-load-misses/ LLC-loads OR cpu/LLC-loads/ LLC-load-misses OR cpu/LLC-load-misses/ LLC-stores OR cpu/LLC-stores/ LLC-store-misses OR cpu/LLC-store-misses/ dTLB-loads OR cpu/dTLB-loads/ dTLB-load-misses OR cpu/dTLB-load-misses/ dTLB-stores OR cpu/dTLB-stores/ dTLB-store-misses OR cpu/dTLB-store-misses/ iTLB-loads OR cpu/iTLB-loads/ iTLB-load-misses OR cpu/iTLB-load-misses/ branch-loads OR cpu/branch-loads/ branch-load-misses OR cpu/branch-load-misses/ node-loads OR cpu/node-loads/ node-load-misses OR cpu/node-load-misses/ node-stores OR cpu/node-stores/ node-store-misses OR cpu/node-store-misses/ #
# grep -m1 'model name' /proc/cpuinfo model name : AMD Ryzen 9 5950X 16-Core Processor
# perf list cache List of pre-defined events (to be used in -e or -M): cpu: L1-dcache-loads OR cpu/L1-dcache-loads/ L1-dcache-load-misses OR cpu/L1-dcache-load-misses/ L1-dcache-prefetches OR cpu/L1-dcache-prefetches/ L1-icache-loads OR cpu/L1-icache-loads/ L1-icache-load-misses OR cpu/L1-icache-load-misses/ dTLB-loads OR cpu/dTLB-loads/ dTLB-load-misses OR cpu/dTLB-load-misses/ iTLB-loads OR cpu/iTLB-loads/ iTLB-load-misses OR cpu/iTLB-load-misses/ branch-loads OR cpu/branch-loads/ branch-load-misses OR cpu/branch-load-misses/ #
root@roc-rk3399-pc:~# uname -m aarch64
root@roc-rk3399-pc:~# perf list cache List of pre-defined events (to be used in -e or -M): armv8_cortex_a53: L1-dcache-loads OR armv8_cortex_a53/L1-dcache-loads/ L1-dcache-load-misses OR armv8_cortex_a53/L1-dcache-load-misses/ L1-dcache-stores OR armv8_cortex_a53/L1-dcache-stores/ L1-dcache-store-misses OR armv8_cortex_a53/L1-dcache-store-misses/ L1-dcache-prefetch-misses OR armv8_cortex_a53/L1-dcache-prefetch-misses/ L1-icache-loads OR armv8_cortex_a53/L1-icache-loads/ L1-icache-load-misses OR armv8_cortex_a53/L1-icache-load-misses/ dTLB-load-misses OR armv8_cortex_a53/dTLB-load-misses/ dTLB-store-misses OR armv8_cortex_a53/dTLB-store-misses/ iTLB-load-misses OR armv8_cortex_a53/iTLB-load-misses/ branch-loads OR armv8_cortex_a53/branch-loads/ branch-load-misses OR armv8_cortex_a53/branch-load-misses/ node-loads OR armv8_cortex_a53/node-loads/ node-stores OR armv8_cortex_a53/node-stores/
armv8_cortex_a72: L1-dcache-loads OR armv8_cortex_a72/L1-dcache-loads/ L1-dcache-load-misses OR armv8_cortex_a72/L1-dcache-load-misses/ L1-dcache-stores OR armv8_cortex_a72/L1-dcache-stores/ L1-dcache-store-misses OR armv8_cortex_a72/L1-dcache-store-misses/ L1-dcache-prefetch-misses OR armv8_cortex_a72/L1-dcache-prefetch-misses/ L1-icache-loads OR armv8_cortex_a72/L1-icache-loads/ L1-icache-load-misses OR armv8_cortex_a72/L1-icache-load-misses/ dTLB-load-misses OR armv8_cortex_a72/dTLB-load-misses/ dTLB-store-misses OR armv8_cortex_a72/dTLB-store-misses/ iTLB-load-misses OR armv8_cortex_a72/iTLB-load-misses/ branch-loads OR armv8_cortex_a72/branch-loads/ branch-load-misses OR armv8_cortex_a72/branch-load-misses/ node-loads OR armv8_cortex_a72/node-loads/ node-stores OR armv8_cortex_a72/node-stores/ root@roc-rk3399-pc:~#
$ ls tools/perf/pmu-events/arch/ arm64 nds32 powerpc riscv s390 test x86
$ ls tools/perf/pmu-events/arch/x86/ alderlake amdzen1 amdzen3 bonnell broadwellde cascadelakex goldmont grandridge haswell icelake ivybridge jaketown mapfile.csv nehalemep rocketlake sapphirerapids silvermont skylakex tigerlake westmereep-sp alderlaken amdzen2 amdzen4 broadwell broadwellx elkhartlake goldmontplus graniterapids haswellx icelakex ivytown knightslanding meteorlake nehalemex sandybridge sierraforest skylake snowridgex westmereep-dp westmereex
$ ls tools/perf/pmu-events/arch/riscv/ mapfile.csv riscv-sbi-firmware.json sifive $ ls tools/perf/pmu-events/arch/riscv/sifive/ u74 $ ls tools/perf/pmu-events/arch/riscv/sifive/u74/ firmware.json instructions.json memory.json microarch.json $
$ head -6 tools/perf/pmu-events/arch/riscv/sifive/u74/instructions.json [ { "EventName": "EXCEPTION_TAKEN", "EventCode": "0x0000100", "BriefDescription": "Exception taken" }, $
# perf list dtlb | head -20 virtual memory: dtlb_load_misses.miss_causes_a_walk [Load misses in all DTLB levels that cause page walks] dtlb_load_misses.stlb_hit [Loads that miss the DTLB and hit the STLB] dtlb_load_misses.walk_active [Cycles when at least one PMH is busy with a page walk for a load. EPT page walk duration are excluded in Skylake] dtlb_load_misses.walk_completed [Load miss in all TLB levels causes a page walk that completes. (All page sizes)] dtlb_load_misses.walk_completed_1g [Page walk completed due to a demand data load to a 1G page] dtlb_load_misses.walk_completed_2m_4m [Page walk completed due to a demand data load to a 2M/4M page] dtlb_load_misses.walk_completed_4k [Page walk completed due to a demand data load to a 4K page] dtlb_load_misses.walk_pending [Counts 1 per cycle for each PMH that is busy with a page walk for a #
# perf stat stress-ng --cpu 8 --timeout 5
stress-ng: info: [530937] setting to a 5 secs run per stressor stress-ng: info: [530937] dispatching hogs: 8 cpu stress-ng: info: [530937] cpu: stress-ng: info: [530937] 1921 Thermal Event Interrupts stress-ng: warn: [530937] 1 System Management Interrupt stress-ng: info: [530937] skipped: 0 stress-ng: info: [530937] passed: 8: cpu (8) stress-ng: info: [530937] failed: 0 stress-ng: info: [530937] metrics untrustworthy: 0 stress-ng: info: [530937] successful run completed in 5.01 secs
# perf stat stress-ng --cpu 8 --timeout 5 Performance counter stats for 'stress-ng --cpu 8 --timeout 5': 38,758.27 msec task-clock # 7.726 CPs utilized 2,926 context-switches # 75.494 /sec 23 cpu-migrations # 0.593 /sec 4,602 page-faults # 118.736 /sec 146,703,471,275 cycles # 3.785 GHz 180,899,175,476 instructions # 1.23 insn per cycle 20,292,093,164 branches # 523.555 M/sec 298,795,750 branch-misses # 1.47% of all branches 5.016468088 seconds time elapsed 38.552115000 seconds user 0.103353000 seconds sys #
# perf stat -d stress-ng --cpu 8 --timeout 5 Performance counter stats for 'stress-ng --cpu 8 --timeout 5': 39,764.39 msec task-clock # 7.926 CPUs utilized 1,979 context-switches # 49.768 /sec 1 cpu-migrations # 0.025 /sec 4,601 page-faults # 115.707 /sec 150,499,468,551 cycles # 3.785 GHz (50.02%) 183,080,730,725 instructions # 1.22 insn per cycle (62.50%) 20,508,317,865 branches # 515.746 M/sec (62.49%) 298,939,900 branch-misses # 1.46% of all branches (62.50%) 23,971,178,026 L1-dcache-loads # 602.830 M/sec (62.49%) 2,853,927,928 L1-dcache-load-misses # 11.91% of all L1-dcache accesses (62.50%) 990,690,295 LLC-loads # 24.914 M/sec (50.02%) 429,819 LLC-load-misses # 0.04% of all L1-icache accesses (50.02%) 5.017089011 seconds time elapsed 39.392300000 seconds user 0.057596000 seconds sys #
# perf stat -dd stress-ng --cpu 8 --timeout 5 Performance counter stats for 'stress-ng --cpu 8 --timeout 5': 39,825.52 msec task-clock # 7.938 CPUs utilized 1,889 context-switches # 47.432 /sec 4 cpu-migrations # 0.100 /sec 4,610 page-faults # 115.755 /sec 151,191,351,560 cycles # 3.796 GHz (30.77%) 180,391,636,461 instructions # 1.19 insn per cycle (38.49%) 20,290,955,975 branches # 509.496 M/sec (38.50%) 291,480,930 branch-misses # 1.44% of all branches (38.49%) 23,643,871,036 L1-dcache-loads # 593.686 M/sec (38.49%) 2,699,885,182 L1-dcache-load-misses # 11.42% of all L1-dcache accesses (38.50%) 987,819,442 LLC-loads # 24.804 M/sec (30.79%) 440,627 LLC-load-misses # 0.04% of all L1-icache accesses (30.78%)L1-icache-loads 203,971,595 L1-icache-load-misses (30.76%) 23,864,128,198 dTLB-loads # 599.217 M/sec (30.77%) 200,750 dTLB-load-misses # 0.00% of all dTLB cache accesses (30.76%) 1,452,617 iTLB-loads # 36.475 K/sec (30.76%) 554,741 iTLB-load-misses # 38.19% of all iTLB cache accesses (30.78%) 5.017276264 seconds time elapsed 39.447619000 seconds user 0.066076000 seconds sys #
# perf stat -ddd stress-ng --cpu 8 --timeout 5 Performance counter stats for 'stress-ng --cpu 8 --timeout 5': 39,881.34 msec task-clock # 7.946 CPUs utilized 1,629 context-switches # 40.846 /sec 1 cpu-migrations # 0.025 /sec 4,581 page-faults # 114.866 /sec 145,105,355,853 cycles # 3.638 GHz (30.80%) 171,764,585,477 instructions # 1.18 insn per cycle (38.48%) 19,247,245,672 branches # 482.613 M/sec (38.48%) 274,979,345 branch-misses # 1.43% of all branches (38.48%) 22,309,648,482 L1-dcache-loads # 559.401 M/sec (38.49%) 2,538,566,841 L1-dcache-load-misses # 11.38% of all L1-dcache accesses (38.48%) 990,773,774 LLC-loads # 24.843 M/sec (30.78%) 442,362 LLC-load-misses # 0.04% of all L1-icache accesses (30.77%)L1-icache-loads 178,304,864 L1-icache-load-misses (30.75%) 22,477,712,511 dTLB-loads # 563.615 M/sec (30.77%) 210,271 dTLB-load-misses # 0.00% of all dTLB cache accesses (30.78%) 1,827,194 iTLB-loads # 45.816 K/sec (30.80%) 335,946 iTLB-load-misses # 18.39% of all iTLB cache accesses (30.80%) L1-dcache-prefetches L1-dcache-prefetch-misses 5.018978942 seconds time elapsed 39.499595000 seconds user 0.053139000 seconds sys #
$ perf stat sleep 1 Performance counter stats for 'sleep 1': 0.51 msec task-clock:u # 0.001 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 68 page-faults:u # 132.683 K/sec 545,228 cycles:u # 1.064 GHz 555,509 instructions:u # 1.02 insn per cycle 123,574 branches:u # 241.120 M/sec 6,957 branch-misses:u # 5.63% of all branches
TopdownL1 # 17.5 % tma_backend_bound # 22.6 % tma_bad_speculation # 42.7 % tma_frontend_bound # 17.1 % tma_retiring TopdownL2 # 21.8 % tma_branch_mispredicts # 11.5 % tma_core_bound # 13.4 % tma_fetch_bandwidth # 29.3 % tma_fetch_latency # 2.7 % tma_heavy_operations # 14.5 % tma_light_operations # 0.8 % tma_machine_clears # 6.1 % tma_memory_bound
# perf stat -e cycles,instructions -C 1 sleep 1 Performance counter stats for 'CPU(s) 1': 14,490,458 cycles 6,429,339 instructions # 0.44 insn per cycle 1.002107055 seconds time elapsed #
# taskset --cpu-list 1 stress-ng --cpu 1 stress-ng: info: [531903] defaulting to a 1 day, 0 secs run per stressor stress-ng: info: [531903] dispatching hogs: 1 cpu
# taskset --cpu-list -p 531903 pid 531903's current affinity list: 1 #
# perf stat -e cycles,instructions --cpu 1 sleep 1 Performance counter stats for 'CPU(s) 1': 3,968,302,827 cycles 6,913,940,181 instructions # 1.74 insn per cycle 1.001263107 seconds time elapsed #
# perf stat -e cycles,instructions -p 531904 sleep 1 Performance counter stats for process id '531904': 3,976,708,553 cycles 6,937,604,548 instructions # 1.74 insn per cycle 1.001302101 seconds time elapsed #
# perf stat --interval-print 1000 -e cycles,instructions -p 531904 # time counts unit events 1.001021154 3,964,676,448 cycles 1.001021154 6,892,761,344 instructions # 1.74 insn per cycle 2.002120190 3,967,753,676 cycles 2.002120190 6,910,658,758 instructions # 1.74 insn per cycle
3.002967437 3,971,391,822 cycles 3.002967437 6,932,051,939 instructions # 1.75 insn per cycle 4.004059795 3,974,283,493 cycles 4.004059795 6,907,156,529 instructions # 1.74 insn per cycle 5.004964383 3,968,905,089 cycles 5.004964383 6,909,361,881 instructions # 1.74 insn per cycle 6.006058613 3,975,937,140 cycles 6.006058613 6,935,675,487 instructions # 1.74 insn per cycle 7.006967950 3,979,759,663 cycles ^C 7.006967950 6,938,796,949 instructions # 1.74 insn per cycle #
# cat /proc/531903/cgroup 0::/user.slice/user-1000.slice/session-38.scope
# ps -eo pid,args,pcpu,cgroup --sort user | grep session-38 531503 sshd: acme@pts/7 0.0 0::/user.slice/user-1000.slice/session-38.scope 531516 -bash 0.0 0::/user.slice/user-1000.slice/session-38.scope 531496 sshd: acme [priv] 0.0 0::/user.slice/user-1000.slice/session-38.scope 531636 sudo su - 0.0 0::/user.slice/user-1000.slice/session-38.scope 531637 su - 0.0 0::/user.slice/user-1000.slice/session-38.scope 531640 -bash 0.0 0::/user.slice/user-1000.slice/session-38.scope 531903 stress-ng --cpu 1 0.0 0::/user.slice/user-1000.slice/session-38.scope 531904 stress-ng-cpu [run] 99.8 0::/user.slice/user-1000.slice/session-38.scope #
# perf stat -I 1000 -e cycles,instructions -G user.slice/user-1000.slice/session-38.scope # time counts unit events 1.000102472 3,959,963,051 cycles user.slice/user-1000.slice/session-38.scope 1.000102472 6,891,303,376 instructions user.slice/user-1000.slice/session-38.scope # 1.74 insn per cycle 2.001681379 3,964,485,454 cycles user.slice/user-1000.slice/session-38.scope 2.001681379 6,919,617,821 instructions user.slice/user-1000.slice/session-38.scope # 1.75 insn per cycle 3.003042309 3,967,487,922 cycles user.slice/user-1000.slice/session-38.scope 3.003042309 6,916,449,901 instructions user.slice/user-1000.slice/session-38.scope # 1.74 insn per cycle 4.004647780 3,961,207,761 cycles user.slice/user-1000.slice/session-38.scope 4.004647780 6,886,653,488 instructions user.slice/user-1000.slice/session-38.scope # 1.74 insn per cycle 5.006216894 3,970,623,638 cycles user.slice/user-1000.slice/session-38.scope 5.006216894 6,927,468,227 instructions user.slice/user-1000.slice/session-38.scope # 1.74 insn per cycle ^C 5.332583270 1,290,202,842 cycles user.slice/user-1000.slice/session-38.scope 5.332583270 2,245,390,805 instructions user.slice/user-1000.slice/session-38.scope # 1.74 insn per cycle #
# bpftool prog | head -8 2: tracing name hid_tail_call tag 7cc47bbf07148bfe gpl loaded_at 2023-09-04T13:07:19-0300 uid 0 xlated 56B jited 113B memlock 4096B map_ids 2 btf_id 2 47: lsm name restrict_filesy tag aae89fa01fe7ee91 gpl loaded_at 2023-09-04T13:07:25-0300 uid 0 xlated 560B jited 305B memlock 4096B map_ids 24 btf_id 60 #
# bpftool prog dump xlated id 47 | head -15
int restrict_filesystems(unsigned long long * ctx): ; int BPF_PROG(restrict_filesystems, struct file *file, int ret) 0: (79) r3 = *(u64 *)(r1 +0) 1: (79) r0 = *(u64 *)(r1 +8) 2: (b7) r1 = 0 ; uint32_t *value, *magic_map, magic_number, zero = 0, *is_allow; 3: (63) *(u32 *)(r10 -24) = r1 ; int BPF_PROG(restrict_filesystems, struct file *file, int ret) 4: (bf) r1 = r0 5: (67) r1 <<= 32 6: (77) r1 >>= 32 ; if (ret != 0) 7: (55) if r1 != 0x0 goto pc+61 8: (b7) r1 = 32 9: (0f) r3 += r1 #
# mkdir test # cd test # ls
# perf record sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.034 MB perf.data (8 samples) ]
# ls perf.data #
# perf evlist cycles:P
# perf evlist -v cycles:P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 #
# perf report --header-only # captured on : Fri Sep 15 16:38:23 2023 # header version : 1 # data offset : 320 # data size : 35520 # feat offset : 35840 # hostname : quaco # os release : 6.4.10-200.fc38.x86_64 # perf version : 6.6.rc1.gf6ff1c760431 # arch : x86_64 # nrcpus online : 8 # nrcpus avail : 8 # cpudesc : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz # cpuid : GenuineIntel,6,142,10 # total memory : 24471460 kB # cmdline : /home/acme/bin/perf record sleep 1 # event : name = cycles:P type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES) ...
# CPU_TOPOLOGY info available, use -I to display # NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, breakpoint = 5, cstate_core = 17, cstate_pkg = 18, i915 = 10, intel_pt = 8, kprobe = 6, msr = 9, power = 19 ... # CACHE info available, use -I to display # time of first sample : 388144.384852 # time of last sample : 388144.385473
# bpf_prog_info 702: bpf_prog_6deef7357e7b4530_sd_fw_egress addr 0xffffffffc0239ff4 size 63 # bpf_prog_info 703: bpf_prog_6deef7357e7b4530_sd_fw_ingress addr 0xffffffffc023a06c size 63 # bpf_prog_info 704: bpf_prog_ccbbf91f3c6979c7_sd_devices addr 0xffffffffc0237a0c size 360 # btf info of id 2 # btf info of id 60 # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake # intel_pt pmu capabilities: topa_multiple_entries=1, psb_cyc=1, single_range_output=1, mtc_periods=249, ip_filtering=1, output_subsys=0 ... # missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID DIR_FORMAT COMPRESSED CLOCK_DATA HYBRID_TOPOLOGY
# sibling threads : 0,4 # sibling threads : 1,5 # sibling threads : 2,6 # sibling threads : 3,7 # CPU 0: Core ID 0, Die ID 0, Socket ID 0 # CPU 1: Core ID 1, Die ID 0, Socket ID 0 # CPU 2: Core ID 2, Die ID 0, Socket ID 0 # CPU 3: Core ID 3, Die ID 0, Socket ID 0 # CPU 4: Core ID 0, Die ID 0, Socket ID 0 # CPU 5: Core ID 1, Die ID 0, Socket ID 0 # CPU 6: Core ID 2, Die ID 0, Socket ID 0 # CPU 7: Core ID 3, Die ID 0, Socket ID 0
# CPU cache info: # L1 Data 32K [0,4] # L1 Instruction 32K [0,4] # L1 Data 32K [1,5] # L1 Instruction 32K [1,5] # L1 Data 32K [2,6] # L1 Instruction 32K [2,6] # L1 Data 32K [3,7] # L1 Instruction 32K [3,7] # L2 Unified 256K [0,4] # L2 Unified 256K [1,5] # L2 Unified 256K [2,6] # L2 Unified 256K [3,7] # L3 Unified 8192K [0-7]
# perf report --tasks # pid tid ppid comm 0 0 -1 |swapper 711036 711036 -1 |sleep #
# perf report --mmaps # pid tid ppid comm 0 0 -1 |swapper 711036 711036 -1 |sleep 55bb9d3b1000-55bb9d3b5000 r-xs 00002000 1969023 /usr/bin/sleep 7f070155e000-7f07016bb000 r-xs 00026000 1985734 /usr/lib64/libc.so.6 7f0701733000-7f0701759000 r-xs 00001000 1985730 /usr/lib64/ld-linux-x86-64.so.2 7ffd4cdd3000-7ffd4cdd5000 r-xs 00000000 0 [vdso] #
# perf record -a -e '{cycles,cache-misses}' sleep 2s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.378 MB perf.data (6666 samples) ] #
# perf report --stdio # Total Lost Samples: 0 # # Samples: 6K of events 'anon group { cycles, cache-misses }' # Event count (approx.): 647299225 # Overhead Command Shared Object Symbol 9.73% 1.55% swapper [kernel.vmlinux] [k] intel_idle_ibrs 2.45% 5.96% Isolated Web Co libxul.so [.] mozilla::EventListenerManager::MarkForCC() 0.98% 0.44% swapper [kernel.vmlinux] [k] psi_group_change 0.72% 0.56% swapper [kernel.vmlinux] [k] update_load_avg 0.65% 0.19% swapper [kernel.vmlinux] [k] timerqueue_add 0.59% 0.50% swapper [kernel.vmlinux] [k] cpuidle_enter_state 0.58% 0.32% swapper [kernel.vmlinux] [k] menu_select 0.56% 0.00% perf-exec [kernel.vmlinux] [k] release_pages 0.55% 0.26% swapper [kernel.vmlinux] [k] switch_mm_irqs_off
# perf report --show-nr-samples --stdio | head -20 # Total Lost Samples: 0 # # Samples: 6K of events 'anon group { cycles, cache-misses }' # Event count (approx.): 647299225 # # Overhead Samples Command Shared Object Symbol 9.73% 1.55% 493 44 swapper [kernel.vmlinux] [k] intel_idle_ibrs 2.45% 5.96% 70 71 Isolated Web Co libxul.so [.] mozilla::EventListenerManager::MarkForCC() 0.98% 0.44% 53 12 swapper [kernel.vmlinux] [k] psi_group_change 0.72% 0.56% 45 16 swapper [kernel.vmlinux] [k] update_load_avg 0.65% 0.19% 37 7 swapper [kernel.vmlinux] [k] timerqueue_add 0.59% 0.50% 36 13 swapper [kernel.vmlinux] [k] cpuidle_enter_state 0.58% 0.32% 36 7 swapper [kernel.vmlinux] [k] menu_select 0.56% 0.00% 1 0 perf-exec [kernel.vmlinux] [k] release_pages 0.55% 0.26% 22 9 swapper [kernel.vmlinux] [k] switch_mm_irqs_off
# perf report --show-nr-samples --sort dso --stdio | head -30 # Total Lost Samples: 0 # # Samples: 6K of events 'anon group { cycles, cache-misses }' # Event count (approx.): 647299225 # # Overhead Samples Shared Object 52.27% 42.98% 2582 1242 [kernel.vmlinux] 20.74% 26.64% 671 568 libxul.so 8.88% 8.96% 210 189 [JIT] tid 7566 5.46% 5.22% 210 139 libc.so.6 3.99% 4.49% 146 105 firefox 2.33% 2.03% 55 48 [JIT] tid 7578 1.00% 1.04% 37 24 libglib-2.0.so.0.7600.4 (deleted) 0.56% 0.12% 14 6 libfdk-aac.so.2.0.0 0.43% 0.22% 16 4 [i915] 0.41% 1.16% 24 34 libglib-2.0.so.0.7600.5 0.38% 0.42% 16 10 libsystemd-shared-253.7-1.fc38.so 0.37% 0.29% 18 8 [vdso] 0.34% 0.41% 14 8 libnspr4.so (deleted) 0.28% 0.41% 15 10 ld-linux-x86-64.so.2 0.18% 0.11% 5 2 libmutter-12.so.0.0.0 (deleted) 0.15% 0.33% 7 6 [JIT] tid 8358 0.14% 0.36% 11 12 libwebkit2gtk-4.1.so.0.8.5 0.13% 0.13% 5 3 libpixman-1.so.0.42.2 0.13% 0.16% 10 5 libjavascriptcoregtk-4.1.so.0.3.14 #
# perf report --stdio --sort bla Error: Unknown --sort key: `bla' Usage: perf report [] -s, --sort sort by key(s): overhead overhead_sys overhead_us overhead_guest_sys overhead_guest_us overhead_children sample period pid comm dso symbol parent cpu socket srcline srcfile local_weight weight transaction trace symbol_size dso_size cgroup cgroup_id ipc_null time code_page_size local_ins_lat ins_lat local_p_stage_cyc p_stage_cyc addr local_retire_lat retire_lat simd dso_from dso_to symbol_from symbol_to mispredict abort in_tx cycles srcline_from srcline_to ipc_lbr addr_from addr_to symbol_daddr dso_daddr locked tlb mem snoop dcacheline symbol_iaddr phys_daddr data_page_size blocked #
# perf report --sort dso,sym --stdio | head -30 # Total Lost Samples: 0 # # Samples: 6K of events 'anon group { cycles, cache-misses }' # Event count (approx.): 647299225 # # Overhead Shared Object Symbol 9.73% 1.55% [kernel.vmlinux] [k] intel_idle_ibrs 2.45% 5.96% libxul.so [.] mozilla::EventListenerManager::MarkForCC() 1.41% 0.16% [kernel.vmlinux] [k] entry_SYSCALL_64 1.36% 0.05% [kernel.vmlinux] [k] syscall_return_via_sysret 1.35% 0.09% [kernel.vmlinux] [k] syscall_exit_to_user_mode 1.31% 1.19% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 1.30% 0.47% [kernel.vmlinux] [k] psi_group_change 1.05% 1.11% firefox [.] malloc 1.01% 0.69% [kernel.vmlinux] [k] update_load_avg 0.81% 0.34% libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt 0.80% 0.61% [kernel.vmlinux] [k] __update_load_avg_se 0.78% 0.67% [kernel.vmlinux] [k] __schedule 0.77% 0.28% [kernel.vmlinux] [k] timerqueue_add 0.71% 0.84% [kernel.vmlinux] [k] __update_load_avg_cfs_rq 0.67% 0.29% [kernel.vmlinux] [k] update_curr 0.65% 0.39% firefox [.] free 0.59% 0.50% [kernel.vmlinux] [k] cpuidle_enter_state #
[root@pumpkin c]# ./prometheusnoop libbpf: prog 'gauge_1arg_xmm': BPF program load failed: Invalid argument libbpf: prog 'gauge_1arg_xmm': -- BEGIN PROG LOAD LOG -- 0: R1=ctx(off=0,imm=0) R10=fp0 ; prometheus_metric_uprobe_1arg_xmm(gauge, valBits, true); 0: (bf) r6 = r1 ; R1=ctx(off=0,imm=0) R6_w=ctx(off=0,imm=0) ; prometheus_metric_uprobe_1arg_xmm(gauge, valBits, true); 1: (79) r7 = *(u64 *)(r6 +80) ; R6_w=ctx(off=0,imm=0) R7_w=scalar() 2: (18) r1 = 0x1800000000 ; R1_w=103079215104 ; struct metric_event_parms args = { 4: (7b) *(u64 *)(r10 -40) = r1 ; R1_w=103079215104 R10=fp0 fp-40_w=103079215104 ; .inc = bpf_read_64bit_xmm_register(0), 5: (b7) r1 = 0 ; R1_w=0 6: (85) call bpf_bpf_read_64bit_xmm_register#212 kernel subsystem misconfigured func bpf_bpf_read_64bit_xmm_register#212 processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'gauge_1arg_xmm': failed to load: -22 libbpf: failed to load object 'prometheusnoop_bpf' libbpf: failed to load BPF skeleton 'prometheusnoop_bpf': -22 Failed to load and verify BPF skeleton [root@pumpkin c]#
$ git grep 'kernel subsystem misconfigured func' kernel/bpf/verifier.c: verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != ctx\n", kernel/bpf/verifier.c: verbose(env, "kernel subsystem misconfigured func %s#%d\n", kernel/bpf/verifier.c: "kernel subsystem misconfigured func %s#%d\n", $
$ perf probe -L check_helper_call <check_helper_call@/home/acme/git/linux/kernel/bpf/verifier.c:0> 0 static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn, int *insn_idx_p) { <SNIP> err = check_func_proto(fn, func_id); if (err) { 58 verbose(env, "kernel subsystem misconfigured func %s#%d\n", 59 func_id_name(func_id), func_id); return err; } <SNIP> $
# perf probe check_helper_call:58 Added new event: probe:check_helper_call_L58 (on check_helper_call:58) You can now use it in all perf tools, such as: perf record -e probe:check_helper_call_L58 -aR sleep 1 #
# perf probe check_helper_call:58 Added new event: probe:check_helper_call_L58 (on check_helper_call:58) You can now use it in all perf tools, such as: perf record -e probe:check_helper_call_L58 -aR sleep 1
# perf trace -e probe:check_helper_call_L58 0.000 prometheusnoop/5535 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270) 0.078 prometheusnoop/5535 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270) ^C#
# perf trace -e probe:check_helper_call_L58/max-stack=9/ --max-events=1 0.000 prometheusnoop/5659 probe:check_helper_call_L58(__probe_ip: ffffffff9f2ff270) check_helper_call ([kernel.kallsyms]) do_check_common ([kernel.kallsyms]) bpf_check ([kernel.kallsyms]) bpf_prog_load ([kernel.kallsyms]) __sys_bpf ([kernel.kallsyms]) __x64_sys_bpf ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64 ([kernel.kallsyms]) syscall (/usr/lib64/libc.so.6) #
# perf probe -x /lib64/libc.so.6 free Added new event: probe_libc:free (on free in /usr/lib64/libc.so.6) You can now use it in all perf tools, such as: perf record -e probe_libc:free -aR sleep 1
# perf stat -e probe_libc:* sleep 1 Performance counter stats for 'sleep 1': 185 probe_libc:free 415 probe_libc:__libc_malloc 1.003725291 seconds time elapsed 0.000000000 seconds user 0.003431000 seconds sys #
# perf trace sleep 1 ? ( ): sleep/3501543 ... [continued]: execve()) = 0 0.020 ( 0.001 ms): sleep/3501543 brk() = 0x562e3ce8d000 0.028 ( 0.001 ms): sleep/3501543 arch_prctl(option: 0x3001, arg2: 0x7ffd5bb13e80) = -1 EINVAL (Invalid argument) 0.044 ( 0.004 ms): sleep/3501543 access(filename: "/etc/ld.so.preload", mode: R) = -1 ENOENT (No such file or directory) 0.051 ( 0.004 ms): sleep/3501543 openat(dfd: CWD, filename: "/etc/ld.so.cache", flags: RDONLY|CLOEXEC) = 3 0.056 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7ffd5bb130a0, flag: 4096) = 0 0.059 ( 0.005 ms): sleep/3501543 mmap(len: 54771, prot: READ, flags: PRIVATE, fd: 3) = 0x7f206c107000 0.066 ( 0.001 ms): sleep/3501543 close(fd: 3) = 0 0.072 ( 0.004 ms): sleep/3501543 openat(dfd: CWD, filename: "/lib64/libc.so.6", flags: RDONLY|CLOEXEC) = 3 0.077 ( 0.002 ms): sleep/3501543 read(fd: 3, buf: 0x7ffd5bb13208, count: 832) = 832 0.081 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12e00, count: 784, pos: 64) = 784 0.084 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12da0, count: 80, pos: 848) = 80 0.086 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12d50, count: 68, pos: 928) = 68 0.089 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7ffd5bb130a0, flag: 4096) = 0 0.092 ( 0.004 ms): sleep/3501543 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7f206c105000 0.099 ( 0.001 ms): sleep/3501543 pread64(fd: 3, buf: 0x7ffd5bb12cf0, count: 784, pos: 64) = 784 0.101 ( 0.005 ms): sleep/3501543 mmap(len: 2104720, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7f206be00000 0.108 ( 0.012 ms): sleep/3501543 mmap(addr: 0x7f206be28000, len: 1523712, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x28000) = 0x7f206be28000 0.120 ( 0.007 ms): sleep/3501543 mmap(addr: 0x7f206bf9c000, len: 360448, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x19c000) = 0x7f206bf9c000 0.129 ( 0.010 ms): sleep/3501543 mmap(addr: 0x7f206bff4000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1f3000) = 0x7f206bff4000 0.143 ( 0.005 ms): sleep/3501543 mmap(addr: 0x7f206bffa000, len: 32144, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7f206bffa000 0.162 ( 0.001 ms): sleep/3501543 close(fd: 3) = 0 0.170 ( 0.005 ms): sleep/3501543 mmap(len: 12288, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7f206c102000 0.178 ( 0.001 ms): sleep/3501543 arch_prctl(option: SET_FS, arg2: 0x7f206c102740) = 0 0.180 ( 0.001 ms): sleep/3501543 set_tid_address(tidptr: 0x7f206c102a10) = 3501543 (sleep) 0.182 ( 0.001 ms): sleep/3501543 set_robust_list(head: 0x7f206c102a20, len: 24) = 0 0.185 ( 0.001 ms): sleep/3501543 rseq(rseq: 0x7f206c1030e0, rseq_len: 32, sig: 1392848979) = 0 0.223 ( 0.005 ms): sleep/3501543 mprotect(start: 0x7f206bff4000, len: 16384, prot: READ) = 0 0.239 ( 0.004 ms): sleep/3501543 mprotect(start: 0x562e3bb5b000, len: 4096, prot: READ) = 0 0.249 ( 0.005 ms): sleep/3501543 mprotect(start: 0x7f206c14a000, len: 8192, prot: READ) = 0 0.261 ( 0.001 ms): sleep/3501543 prlimit64(resource: STACK, old_rlim: 0x7ffd5bb13be0) = 0 0.266 ( 0.006 ms): sleep/3501543 munmap(addr: 0x7f206c107000, len: 54771) = 0 0.285 ( 0.002 ms): sleep/3501543 getrandom(ubuf: 0x7f206bfff4d8, len: 8, flags: NONBLOCK) = 8 0.288 ( 0.001 ms): sleep/3501543 brk() = 0x562e3ce8d000 0.290 ( 0.004 ms): sleep/3501543 brk(brk: 0x562e3ceae000) = 0x562e3ceae000 0.300 ( 0.007 ms): sleep/3501543 openat(dfd: CWD, filename: "", flags: RDONLY|CLOEXEC) = 3 0.309 ( 0.002 ms): sleep/3501543 newfstatat(dfd: 3, filename: "", statbuf: 0x7f206bff9b80, flag: 4096) = 0 0.312 ( 0.005 ms): sleep/3501543 mmap(len: 224096080, prot: READ, flags: PRIVATE, fd: 3) = 0x7f205e800000 0.322 ( 0.001 ms): sleep/3501543 close(fd: 3) = 0 0.351 (1000.073 ms): sleep/3501543 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd5bb13e70) = 0 1000.451 ( 0.003 ms): sleep/3501543 close(fd: 1) = 0 1000.458 ( 0.002 ms): sleep/3501543 close(fd: 2) = 0 1000.465 ( ): sleep/3501543 exit_group() = ? #
# perf trace --syscalls -e syscalls:sys_enter_getrandom --switch-on syscalls:sys_enter_getrandom -- sleep 1 ? ( ): sleep/3501438 ... [continued]: getrandom()) = 8 0.002 ( 0.001 ms): sleep/3501438 brk() = 0x561a2fa09000 0.004 ( 0.004 ms): sleep/3501438 brk(brk: 0x561a2fa2a000) = 0x561a2fa2a000 0.015 ( 0.007 ms): sleep/3501438 openat(dfd: CWD, filename: "", flags: RDONLY|CLOEXEC) = 3 0.024 ( 0.002 ms): sleep/3501438 newfstatat(dfd: 3, filename: "", statbuf: 0x7fee373f9b80, flag: 4096) = 0 0.027 ( 0.005 ms): sleep/3501438 mmap(len: 224096080, prot: READ, flags: PRIVATE, fd: 3) = 0x7fee29c00000 0.037 ( 0.001 ms): sleep/3501438 close(fd: 3) = 0 0.068 (1000.047 ms): sleep/3501438 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd072477c0) = 0 1000.137 ( 0.003 ms): sleep/3501438 close(fd: 1) = 0 1000.143 ( 0.002 ms): sleep/3501438 close(fd: 2) = 0 1000.151 ( ): sleep/3501438 exit_group() = ? #
# perf trace -h pagefault Usage: perf trace [] [ ] or: perf trace [ ] -- [ ] or: perf trace record [ ] [ ] or: perf trace record [ ] -- [ ] -F, --pf Trace pagefaults [root@five ~]#
# perf trace -F 0.000 ( 0.000 ms): gnome-shell/10430 majfault [__memmove_avx_unaligned_erms+0x30a] => /dev/dri/card1@0x124aa8000 (d.) 1631.809 ( 0.000 ms): rpmkeys/3184890 majfault [0x14f0] => /usr/bin/rpmkeys@0x14f0 (d.) 1632.565 ( 0.000 ms): rpmkeys/3184890 majfault [0x89d0] => /usr/lib64/libpopt.so.0.0.2@0x89d0 (d.) 1650.404 ( 0.000 ms): rpmkeys/3184890 majfault [0x64c80] => /usr/lib64/librpm_sequoia.so.1@0x64c80 (d.) 2415.842 ( 0.000 ms): dnf/3183950 majfault [0x812bf] => /usr/lib64/libsqlite3.so.0.8.6@0x812bf (d.) 3722.995 ( 0.000 ms): alternatives/3184948 majfault [padzero+0x38] => 0x563ba5e34004 (?k)
# perf trace -F --call-graph dwarf 0.000 ( 0.000 ms): gnome-shell/10430 majfault [__memmove_avx_unaligned_erms+0x30a] => /dev/dri/card1@0x124aa8000 (d.) __memmove_avx_unaligned_erms (/usr/lib64/libc.so.6) 1631.809 ( 0.000 ms): rpmkeys/3184890 majfault [0x14f0] => /usr/bin/rpmkeys@0x14f0 (d.) [0x14f0] (/usr/bin/rpmkeys) 1632.565 ( 0.000 ms): rpmkeys/3184890 majfault [0x89d0] => /usr/lib64/libpopt.so.0.0.2@0x89d0 (d.) [0x89d0] (/usr/lib64/libpopt.so.0.0.2) 1650.404 ( 0.000 ms): rpmkeys/3184890 majfault [0x64c80] => /usr/lib64/librpm_sequoia.so.1@0x64c80 (d.) [0x64c80] (/usr/lib64/librpm_sequoia.so.1) 2415.842 ( 0.000 ms): dnf/3183950 majfault [0x812bf] => /usr/lib64/libsqlite3.so.0.8.6@0x812bf (d.) [0x812bf] (/usr/lib64/libsqlite3.so.0.8.6) 3722.995 ( 0.000 ms): alternatives/3184948 majfault [padzero+0x38] => 0x563ba5e34004 (?k) padzero ([kernel.kallsyms]) load_elf_binary ([kernel.kallsyms]) bprm_execve ([kernel.kallsyms]) do_execveat_common.isra.0 ([kernel.kallsyms]) __x64_sys_execve ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
# perf list block:* List of pre-defined events (to be used in -e or -M): block:block_bio_backmerge [Tracepoint event] block:block_bio_bounce [Tracepoint event] block:block_bio_complete [Tracepoint event] block:block_bio_frontmerge [Tracepoint event] block:block_bio_queue [Tracepoint event] block:block_bio_remap [Tracepoint event] block:block_dirty_buffer [Tracepoint event] block:block_getrq [Tracepoint event] block:block_plug [Tracepoint event] block:block_rq_complete [Tracepoint event] block:block_rq_error [Tracepoint event] block:block_rq_insert [Tracepoint event] block:block_rq_issue [Tracepoint event] block:block_rq_merge [Tracepoint event] block:block_rq_remap [Tracepoint event] block:block_rq_requeue [Tracepoint event] block:block_split [Tracepoint event] block:block_touch_buffer [Tracepoint event] block:block_unplug [Tracepoint event] #
Parameters: struct request_queue *q request queue to plug Description: Plug the request queue q. Do not allow block operation requests to be sent to the device driver. Instead, accumulate requests in the queue to improve throughput performance of the block device.
# perf trace -e block:* --max-events=20 --call-graph=fp 0.000 dd/521824 block:block_dirty_buffer(dev: 8388608, sector: 428361, size: 4096) mark_buffer_dirty ([kernel.kallsyms]) mark_buffer_dirty ([kernel.kallsyms]) __block_commit_write.isra.0 ([kernel.kallsyms]) block_write_end ([kernel.kallsyms]) blkdev_write_end ([kernel.kallsyms]) generic_perform_write ([kernel.kallsyms]) __generic_file_write_iter ([kernel.kallsyms]) blkdev_write_iter ([kernel.kallsyms]) vfs_write ([kernel.kallsyms]) ksys_write ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64 ([kernel.kallsyms]) __GI___libc_write (/usr/lib64/libc.so.6) write_output (/usr/bin/dd) main (/usr/bin/dd) __libc_start_call_main (/usr/lib64/libc.so.6) __libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6) _start (/usr/bin/dd)
12.896 usb-storage/467 block:block_rq_complete(dev: 8388608, sector: 3397560, nr_sector: 8, rwbs: "R", cmd: "") blk_update_request ([kernel.kallsyms]) blk_update_request ([kernel.kallsyms]) scsi_end_request ([kernel.kallsyms]) scsi_io_completion ([kernel.kallsyms]) usb_stor_control_thread ([usb_storage]) kthread ([kernel.kallsyms]) ret_from_fork ([kernel.kallsyms])
# perf script -l List of available trace scripts: rwtop [interval] system-wide r/w top wakeup-latency system-wide min/max/avg wakeup latency rw-by-pid system-wide r/w activity rw-by-filer/w activity for a program, by file failed-syscalls [comm] system-wide failed syscalls futex-contention futext contention measurement netdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing time export-to-sqlite [database name] [columns] [calls] export perf data to a sqlite3 database gecko create firefox gecko profile json format from perf.data mem-phys-addr resolve physical address samples sched-migration sched migration overview sctop [comm] [interval] syscall top failed-syscalls-by-pid [comm] system-wide failed syscalls, by pid intel-pt-events print Intel PT Events including Power Events and PTWRITE syscall-counts [comm] system-wide syscall counts compaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] display time taken by mm compaction task-analyzer analyze timings of tasks stackcollapse produce callgraphs in short form for scripting use syscall-counts-by-pid [comm] system-wide syscall counts, by pid powerpc-hcalls export-to-postgresql [database name] [columns] [calls] export perf data to a postgresql database net_dropmonitor display a table of dropped frames event_analyzing_sample analyze all perf samples flamegraph create flame graphs #
# perf script rwtop read counts by pid: pid comm # reads bytes_req bytes_read ------ --------------- ------- --------- ---------- 724574 dd 3936 2015232 2015232 13277 pool-gsd-smartc 14 23600 23600 670459 sshd 10 327680 2764 937 systemd-oomd 4 2064 2064 43738 MemoryPoller 3 3072 1559 437440 DNS Res~er #136 2 8192 797 437735 DNS Res~er #139 2 8192 797 12995 pcscd 14 112 112 43488 firefox 67 67 67 1128 thermald 3 24573 30 439339 Isolated Web Co 20 20 20 439899 Isolated Web Co 10 10 10 938 systemd-resolve 4 16416 8 524177 Isolated Web Co 7 7 7 43524 Socket Thread 5 10240 5 478958 Isolated Web Co 4 4 4 439563 Isolated Servic 4 4 4 43749 Isolated Web Co 3 3 3 43521 IPC I/O Parent 3 3 3
write counts by pid: pid comm # writes bytes_written ------ --------------- -------- ------------- 724574 dd 3936 2015232 670459 sshd 10 3208 43488 firefox 52 52 439339 Isolated Web Co 10 10 43521 IPC I/O Parent 5 5 439344 Socket Thread 5 5 439899 Isolated Web Co 5 5 524177 Isolated Web Co 4 4 43525 IPDL Background 4 4 439348 Timer 4 4 43734 URL Classifier 4 4 439912 Timer 3 3 43524 Socket Thread 3 3 43522 Timer 3 3 439907 IPC I/O Child 2 2 478967 Timer 2 2 724466 Backgro~l #1146 2 2 478958 Isolated Web Co 2 2 43749 Isolated Web Co 2 2 #
# perf probe -x /lib64/libc.so.6 -V malloc Available variables at malloc @<__libc_malloc+0> char* __PRETTY_FUNCTION__ size_t bytes #
# perf probe -x /lib64/libc.so.6 __libc_malloc bytes Added new event: probe_libc:__libc_malloc (on __libc_malloc in /usr/lib64/libc.so.6 with bytes) You can now use it in all perf tools, such as: perf record -e probe_libc:__libc_malloc -aR sleep 1 #
# perf record -a -e probe_libc:__libc_malloc sleep 5s [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 2.668 MB perf.data (8587 samples) ] #
# perf script | head perf 691772 [007] 361976.720228: probe_libc:__libc_malloc: (7ff7e3cbeb90) bytes=0x40 perf 691772 [007] 361976.720265: probe_libc:__libc_malloc: (7ff7e3cbeb90) bytes=0xa8 sleep 691773 [002] 361976.721292: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x5 sleep 691773 [002] 361976.721319: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x1d8 sleep 691773 [002] 361976.721329: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x1000 sleep 691773 [002] 361976.721336: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x640 sleep 691773 [002] 361976.721339: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x400 sleep 691773 [002] 361976.721352: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x5 sleep 691773 [002] 361976.721354: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x33 sleep 691773 [002] 361976.721357: probe_libc:__libc_malloc: (7fd79aea0b90) bytes=0x33 #
# perf script | wc -l 8587 #
# perf evlist probe_libc:__libc_malloc dummy:HG # Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events #
# perf evlist --trace-fields probe_libc:__libc_malloc: trace_fields: __probe_ip,bytes dummy:HG: (not a tracepoint) #
# perf script -g python generated Python script: perf-script.py #
# ls -la perf-script.py -rw-r--r--. 1 root root 2065 Sep 17 11:26 perf-script.py #
# mv perf-script.py mallocs.py
# perf script -s mallocs.py | tail probe_libc____libc_malloc 2 361981.720766854 405742 gnome-control-c
__probe_ip=139765834955664, bytes=48
Sample: {cpu=2, cpumode=2, period=1, tid=405742}
probe_libc____libc_malloc 4 361981.722412299 463773 weechat __probe_ip=140083330030480, bytes=15 Sample: {cpu=4, cpumode=2, period=1, tid=463773} in trace_end #
def print_header(event_name, cpu, secs, nsecs, pid, comm): print("%-20s %5u %05u.%09u %8u %-20s " % \ (event_name, cpu, secs, nsecs, pid, comm), end="")
def trace_begin(): print("in trace_begin")
def trace_end(): print("in trace_end")
def probe_libc____libc_malloc(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, __probe_ip, bytes, perf_sample_dict): print_header(event_name, common_cpu, common_secs, common_nsecs, common_pid, common_comm) print("__probe_ip=%u, bytes=%u" % (__probe_ip, bytes)) print('Sample: {'+get_dict_as_string(perf_sample_dict['sample'], ', ')+'}')
nr_entries = {}
def probe_libc____libc_malloc(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, __probe_ip, bytes, perf_sample_dict):
if bytes in nr_entries: nr_entries[bytes] += 1 else: nr_entries[bytes] = 1
def trace_end(): for size in sorted(nr_entries.keys()): nr_entries = nr_entries[size] if (nr_entries >= 64): print("%6d: %4d" % (size, nr_entries))
# perf script -s mallocs.py 5: 342 16: 1247 24: 328 27: 82 32: 1248 35: 80 38: 104 40: 320 41: 119 44: 70 48: 340 56: 349 64: 958 96: 312 112: 298 128: 594 472: 419 800: 88 4096: 309 8192: 98 #
min_nr_entries = 64
def trace_end(): for size in sorted(sizes.keys()): nr_entries = sizes[size] if (nr_entries >= min_nr_entries): print("%6d: %4d" % (size, nr_entries))
if __name__ == '__main__': parser = argparse.ArgumentParser("Show malloc histogram") parser.add_argument('--min_nr_entries', default=64, help='Show sizes with at least min_nr_entries')
# Parse the command-line arguments args = parser.parse_args() min_nr_entries = int(args.min_nr_entries)
# perf script -s mallocs.py -- --help usage: show malloc histogram [-h] [--min_nr_entries MIN_NR_ENTRIES] options: -h, --help show this help message and exit --min_nr_entries MIN_NR_ENTRIES Show sizes with at least min_nr_entries #
# perf script -s mallocs.py -- --min_nr_entries 512 16: 1247 32: 1248 64: 958 128: 594 #
$ make -C tools/bpf/runqslower/ make: Entering directory '/home/acme/git/bpf/tools/bpf/runqslower' LINK /home/acme/git/bpf/tools/bpf/runqslower/.output//libbpf/libbpf.a LINK /home/acme/git/bpf/tools/bpf/runqslower/.output/bpftool INSTALL bpftool GEN vmlinux.h BPF runqslower.bpf.o GEN-SKEL runqslower.skel.h CC runqslower.o BINARY runqslower make: Leaving directory '/home/acme/git/bpf/tools/bpf/runqslower'
$ file tools/bpf/runqslower/.output/runqslower tools/bpf/runqslower/.output/runqslower: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=e7986148fa9629ef9ca8ea1fd76c3a0bbdf70b06, with debug_info, not stripped
$ ls -la tools/bpf/runqslower/.output/runqslower -rwxrwxr-x. 1 507016 tools/bpf/runqslower/.output/runqslower $
$ strip tools/bpf/runqslower/.output/runqslower $ ls -la tools/bpf/runqslower/.output/runqslower -rwxrwxr-x. 1 249208 tools/bpf/runqslower/.output/runqslower $
$ sudo tools/bpf/runqslower/.output/runqslower 500 Tracing run queue latency higher than 500 us TIME COMM PID LAT(us) 19:00:11 swapper/3 22111 1904 19:00:30 weechat 41 1244 19:00:30 weechat 41 1316 ^C $
$ cat tools/bpf/runqslower/runqslower.h /* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ #ifndef __RUNQSLOWER_H #define __RUNQSLOWER_H #define TASK_COMM_LEN 16 struct event { char task[TASK_COMM_LEN]; __u64 delta_us; pid_t pid; }; #endif /* __RUNQSLOWER_H */ $
#include <bpf.h> #include "runqslower.h" #include "runqslower.skel.h" void handle_event(void *ctx, int cpu, void *data, __u32 data_sz) { const struct event *e = data; strftime(ts, sizeof(ts), "%H:%M:%S", tm); printf("%-8s %-16s %-6d %14llu\n", ts, e->task, e->pid, e->delta_us); }
int main(int argc, char **argv) { err = argp_parse(&argp, argc, argv, 0, NULL, NULL); obj = runqslower_bpf__open(); obj->rodata->min_us = env.min_us; // arg, BPF global var
err = runqslower_bpf__load(obj); err = runqslower_bpf__attach(obj);
printf("%-8s %-16s %-6s %14s\n", "TIME", "COMM", "PID", "LAT(us)"); pb_opts.sample_cb = handle_event; pb_opts.lost_cb = handle_lost_events; pb = perf_buffer__new(bpf_map__fd(obj->maps.events), 64, &pb_opts);
while ((err = perf_buffer__poll(pb, 100)) >= 0); }
$ cat tools/bpf/runqslower/runqslower.bpf.c #include "vmlinux.h" #include <bpf_helpers.h> #include "runqslower.h"
const volatile __u64 min_us = 0; const volatile pid_t targ_pid = 0;
struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10240); __type(key, u32); __type(value, u64); } start SEC(".maps");
struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); __uint(key_size, sizeof(u32)); __uint(value_size, sizeof(u32)); } events SEC(".maps");
$ readelf -sW tools/bpf/runqslower/.output/runqslower.bpf.o | \ egrep min_us\|targ_pid Num: Size Type Bind Vis Ndx Name 18: 8 OBJECT GLOBAL DEFAULT 5 min_us 20: 4 OBJECT GLOBAL DEFAULT 5 targ_pid $
__always_inline static int trace_enqueue(u32 tgid, u32 pid) { if (!pid || (targ_pid && targ_pid != pid)) return 0; u64 ts = bpf_ktime_get_ns(); bpf_map_update_elem(&start, &pid, &ts, 0); return 0; }
SEC("tp_btf/sched_wakeup") int handle__sched_wakeup(u64 *ctx) { // TP_PROTO(struct task_struct *p) struct task_struct *p = (void *)ctx[0]; return trace_enqueue(p->tgid, p->pid); }
SEC("tp_btf/sched_switch") int handle__sched_switch(u64 *ctx) { // TP_PROTO(bool preempt, struct task_struct *prev, // struct task_struct *next) struct task_struct *prev = (struct task_struct *)ctx[1]; struct task_struct *next = (struct task_struct *)ctx[2]; struct event event = {}; u32 pid = next->pid;
u64 *tstamp = bpf_map_lookup_elem(&start, &pid); if (!tstamp) return 0; /* missed enqueue */
u64 delta_us = (bpf_ktime_get_ns() - *tstamp) / 1000; if (min_us && delta_us <= min_us) return 0;
event.pid = pid; event.delta_us = delta_us; bpf_get_current_comm(&event.task, sizeof(event.task));
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
bpf_map_delete_elem(&start, &pid); return 0; }
$ ls -la tools/bpf/runqslower/.output/runqslower.bpf.o -rw-rw-r--. 1 31688 tools/bpf/runqslower/.output/runqslower.bpf.o
$ file tools/bpf/runqslower/.output/runqslower.bpf.o tools/bpf/runqslower/.output/runqslower.bpf.o: ELF 64-bit LSB relocatable, eBPF, version 1 (SYSV), not stripped $
$ eu-readelf -SW tools/bpf/runqslower/.output/runqslower.bpf.o | grep \.rel Section Headers: [Nr] Name Type Size [11] .reltp_btf/sched_wakeup REL 20 [13] .reltp_btf/sched_switch REL 60 [14] .rel.BTF REL 50 [15] .rel.BTF.ext REL 460 $
$ cd net/ipv4/ $ grep -l 'tcp_register_congestion_control(&' tcp_*.c tcp_bbr.c tcp_bic.c tcp_cdg.c tcp_cubic.c tcp_dctcp.c tcp_highspeed.c tcp_htcp.c tcp_hybla.c tcp_illinois.c tcp_lp.c tcp_nv.c tcp_scalable.c tcp_vegas.c tcp_veno.c tcp_westwood.c tcp_yeah.c $
static unsigned int dctcp_shift_g = 4; /* g = 1/2^4 */ static unsigned int dctcp_alpha_on_init = DCTCP_MAX_ALPHA;
// tools/testing/selftests/bpf/bpf_tcp_helpers.h struct tcp_sock { struct inet_connection_sock inet_conn; __u32 rcv_nxt, snd_nxt, snd_una; __u8 ecn_flags; __u32 delivered; __u32 delivered_ce; __u32 snd_cwnd, snd_cwnd_cnt, snd_cwnd_clamp; __u32 snd_ssthresh; __u8 syn_data:1, /* SYN includes data */ syn_fastopen:1, /* SYN includes Fast Open option */ syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ syn_fastopen_ch:1, /* Active TFO re-enabling probe */ syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ save_syn:1, /* Save headers of SYN packet */ is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ syn_smc:1; /* SYN includes SMC */ __u32 max_packets_out; __u32 lsndtime; __u32 prior_cwnd; } __attribute__((preserve_access_index));
$ pahole tcp_sock | grep size: -A4 /* size: 2192, cachelines: 35, members: 135 */ /* sum members: 2181, holes: 3, sum holes: 8 */ /* sum bitfield members: 24 bits (3 bytes) */ /* paddings: 3, sum paddings: 10 */ /* last cacheline: 16 bytes */ $
SEC("struct_ops/dctcp_init") void BPF_PROG(dctcp_init, struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); struct dctcp *ca = inet_csk_ca(sk); ca->prior_rcv_nxt = tp->rcv_nxt; ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA); ca->loss_cwnd = 0; ca->ce_state = 0; dctcp_reset(tp, ca); }
SEC("struct_ops/dctcp_ssthresh") __u32 BPF_PROG(dctcp_ssthresh, struct sock *sk) { struct dctcp *ca = inet_csk_ca(sk); struct tcp_sock *tp = tcp_sk(sk); ca->loss_cwnd = tp->snd_cwnd; return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) > 11U), 2U); }
Date: Mon Nov 15 11:02:37 2021 -0800 From: Eric DumazetMove sk_bind_phc next to sk_peer_lock to fill a hole. @@ -489,5 +489,6 @@ struct sock { u16 sk_busy_poll_budget; #endif spinlock_t sk_peer_lock; + int sk_bind_phc; struct pid *sk_peer_pid; const struct cred *sk_peer_cred; @@ -498,5 +499,4 @@ struct sock { seqlock_t sk_stamp_seq; #endif u16 sk_tsflags; - int sk_bind_phc; u8 sk_shutdown;
$ git log -2 --oneline 1ace2b4d2b4e1db8f 1ace2b4d2b4e1db8 net: shrink struct sock by 8 bytes 1b31debca8328448 ipv6: shrink struct ipcm6_cookie $
$ pahole -C list_head ~/git/build/v5.18-rc6+/vmlinux struct list_head { struct list_head * next; /* 0 8 */ struct list_head * prev; /* 8 8 */ /* size: 16, cachelines: 1, members: 2 */ /* last cacheline: 16 bytes */ }; $
# pahole -C _IO_FILE ~/bin/perf | head struct _IO_FILE { int _flags; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ char * _IO_read_ptr; /* 8 8 */ char * _IO_read_end; /* 16 8 */ char * _IO_read_base; /* 24 8 */ char * _IO_write_base; /* 32 8 */ char * _IO_write_ptr; /* 40 8 */ #
$ pahole --reorganize task_struct | tail /* --- cacheline 142 boundary (9088 bytes) was 56 bytes ago --- */ struct thread_struct thread __attribute__((__aligned__(64))); /* 9144 4416 */ /* size: 13560, cachelines: 212, members: 252 */ /* sum members: 13487, holes: 2, sum holes: 57 */ /* sum bitfield members: 79 bits, bit holes: 2, sum bit holes: 49 bits */ /* paddings: 6, sum paddings: 49 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */ /* last cacheline: 56 bytes */ }; /* saved 136 bytes and 2 cachelines! */ $
# pahole --contains sock inet_sock unix_sock packet_sock alg_sock bpf_ctx_convert netlink_sock mctp_sock xdp_sock #
# pahole --contains inet_sock inet_connection_sock raw_sock udp_sock sctp_sock raw6_sock #
# pahole --contains inet_connection_sock tcp_sock mptcp_sock #
# pahole tcp_sock | head struct tcp_sock { struct inet_connection_sock inet_conn; /* 0 1400 */ /* --- cacheline 21 boundary (1344 bytes) was 56 bytes ago --- */ u16 tcp_header_len; /* 1400 2 */ u16 gso_segs; /* 1402 2 */ __be32 pred_flags; /* 1404 4 */ /* --- cacheline 22 boundary (1408 bytes) --- */ u64 bytes_received; /* 1408 8 */ u32 segs_in; /* 1416 4 */ u32 data_segs_in; /* 1420 4 */ #
# pahole --expand_types tcp_sock | head struct tcp_sock { struct inet_connection_sock { struct inet_sock { struct sock { struct sock_common { union { /* typedef __addrpair -> __u64 */ long long unsigned int skc_addrpair; /* 0 8 */ struct { /* typedef __be32 -> __u32 */ unsigned int skc_daddr; /* 0 4 */ /* typedef __be32 -> __u32 */ unsigned int skc_rcv_saddr; /* 4 4 */ }; /* 0 8 */ }; /* 0 8 */ #
commit 91b6d325635617540b6a1646ddb138bb17cbd569 Author: Eric DumazetDate: Mon Nov 15 11:02:39 2021 -0800 net: cache align tcp_memory_allocated, tcp_sockets_allocated tcp_memory_allocated and tcp_sockets_allocated often share a common cache line, source of false sharing.
+++ b/net/ipv4/tcp.c -atomic_long_t tcp_memory_allocated; // Current allocated memory +atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; // Current allocated memory -struct percpu_counter tcp_sockets_allocated; +struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
+++ b/drivers/md/bcache/writeback.c sectors_dirty = atomic_add_return(s, d->stripe_sectors_dirty + stripe); - if (sectors_dirty == d->stripe_size) - set_bit(stripe, d->full_dirty_stripes); - else - clear_bit(stripe, d->full_dirty_stripes); + if (sectors_dirty == d->stripe_size) { + if (!test_bit(stripe, d->full_dirty_stripes)) + set_bit(stripe, d->full_dirty_stripes); + } else { + if (test_bit(stripe, d->full_dirty_stripes)) + clear_bit(stripe, d->full_dirty_stripes); + }
+++ b/include/linux/page_counter.h @@ -12,7 +12,6 @@ struct page_counter { unsigned long low; unsigned long high; unsigned long max; - struct page_counter *parent; /* effective memory.min and memory.min usage tracking */ unsigned long emin; @@ -26,6 +25,12 @@ struct page_counter { unsigned long watermark; unsigned long failcnt; + /* + * 'parent' is placed here to be far from 'usage' to reduce cache + * false sharing, as 'usage' is written mostly while parent is + * frequently read for cgroup's hierarchical counting nature. + */ + struct page_counter *parent; };
$ pahole page_counter struct page_counter { atomic_long_t usage; /* 0 8 */ long unsigned int min; /* 8 8 */ long unsigned int low; /* 16 8 */ long unsigned int high; /* 24 8 */ long unsigned int max; /* 32 8 */ long unsigned int emin; /* 40 8 */ atomic_long_t min_usage; /* 48 8 */ atomic_long_t children_min_usage; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int elow; /* 64 8 */ atomic_long_t low_usage; /* 72 8 */ atomic_long_t children_low_usage; /* 80 8 */ long unsigned int watermark; /* 88 8 */ long unsigned int failcnt; /* 96 8 */ struct page_counter *parent; /* 104 8 */ /* size: 112, cachelines: 2, members: 14 */ /* last cacheline: 48 bytes */ }; $
$ cd /sys/kernel/btf $ ls -lh vmlinux -r--r--r--. 1 root root 5.1M Sep 8 20:38 vmlinux $ $ ls -lh i915 -r--r--r--. 1 root root 556K Sep 12 09:29 i915 $ $ ls -1 | wc -l 204 $ lsmod | wc -l 204 $ lsmod | head -2 Module Size Used by sctp 434176 28 $
$ pahole page_pool struct page_pool { struct page_pool_params p; /* 0 56 */ struct delayed_work release_dw; /* 56 88 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */ void (*disconnect)(void *); /* 144 8 */ long unsigned int defer_start; /* 152 8 */ long unsigned int defer_warn; /* 160 8 */ u32 pages_state_hold_cnt; /* 168 4 */ unsigned int frag_offset; /* 172 4 */ struct page * frag_page; /* 176 8 */ long int frag_users; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ u32 xdp_mem_id; /* 192 4 */ /* XXX 60 bytes hole, try to pack */ /* --- cacheline 4 boundary (256 bytes) --- */ struct pp_alloc_cache alloc __attribute__((__aligned__(64))); /* 256 1032 */ /* XXX 56 bytes hole, try to pack */ /* --- cacheline 21 boundary (1344 bytes) --- */ struct ptr_ring ring __attribute__((__aligned__(64))); /* 1344 192 */ /* XXX last struct has 48 bytes of padding */ /* --- cacheline 24 boundary (1536 bytes) --- */ atomic_t pages_state_release_cnt; /* 1536 4 */ refcount_t user_cnt; /* 1540 4 */ u64 destroy_cnt; /* 1544 8 */ /* size: 1600, cachelines: 25, members: 15 */ /* sum members: 1436, holes: 2, sum holes: 116 */ /* padding: 48 paddings: 2, sum paddings: 52 */ /* forced alignments: 2, forced holes: 2, sum forced holes: 116 */ } __attribute__((__aligned__(64))); $
18.8.1.2 Load Latency Performance Monitoring Facility The load latency facility provides software a means to charaterize the average load latency to different levels of cache/memory hierarchy. This facility requires processor supporting enhanced PEBS record format in the PEBS buffer, see Table 18-23. This field measures the load latency from load's first dispatch of till final data writeback from the memory subsystem. The latency is reported for retired demand load operations and in core cycles (it accounts for re-dispatches).
# perf mem record -a sleep 1 # # perf mem -t load report --sort=mem --stdio # Total Lost Samples: 0 # # Samples: 51K of event 'cpu/mem-loads,ldlat=30/P' # Total weight : 4819902 # Sort order : mem # # Overhead Samples Memory access # ........ ............ ........................ 44.87% 20217 LFB or LFB hit 27.30% 18618 L3 or L3 hit 22.53% 11712 L1 or L1 hit 4.85% 637 Local RAM or RAM hit 0.25% 1 Uncached or N/A hit 0.20% 188 L2 or L2 hit 0.00% 35 L3 miss
# perf mem record sleep 1 # # perf mem -t load report --sort=mem --stdio # Total Lost Samples: 0 # # Samples: 16 of event 'cpu/mem-loads,ldlat=30/P' # Total weight : 1556 # Sort order : mem # Overhead Samples Memory access # ........ ............ ........................ # 64.52% 8 LFB or LFB hit 14.07% 4 L1 or L1 hit 11.05% 3 L3 or L3 hit 10.35% 1 Local RAM or RAM hit
$ make -j8 O=../build/allmodconfig/ make[1]: Entering directory '/home/acme/git/build/allmodconfig' <SNIP> # perf mem record sleep 1m [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.037 MB perf.data (20 samples) ] #
# perf mem report --stdio # Total Lost Samples: 0 # # Samples: 11 of event 'cpu/mem-loads,ldlat=30/P' # Total weight : 2155 # Sort order : local_weight,mem,sym,dso,symbol_daddr,dso_daddr # # Local Mem #Overhead Weig Access Symbol Sh Object Data Symbol Data Obj #........ .... ........ ................... ......... ................ ........... 23.94% 516 LocalRAM copy_page [kernel] 0xffff8d42228ea900 [unknown] 15.31% 330 LFB flush_signal_handle [kernel] 0xffff8d3f976020a0 [unknown] 14.66% 316 LFB strlen [kernel] 0xffffffff9b5f4cd3 [kernel].ro 13.36% 288 LFB _dl_relocate_object ld-linux.so 0x00007f6ccdc23068 libc.so.6 11.46% 247 LFB next_uptodate_page [kernel] 0xffffe401957e4df4 [unknown] 7.33% 158 LFB copy_page [kernel] 0xffff8d41f2dae920 [unknown] 4.04% 87 LFB unlock_page_memcg [kernel] 0xffffe4019333d8b8 [unknown] 3.06% 66 L1 check_preemption_di [kernel] 0xffffa8e8622ffc80 [unknown] 2.69% 58 LFB perf_output_begin [kernel] 0xffff8d3f52a1b01c [unknown] 2.13% 46 L3 task_work_run [kernel] 0xffff8d3f4a9c802c [unknown] 2.00% 43 L1 kmem_cache_alloc_tr [kernel] 0xffffa8e8622ffbc8 [unknown]
# perf c2c record -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 7.787 MB perf.data (2450 samples) ] # perf evlist cpu/mem-loads,ldlat=30/P cpu/mem-stores/P dummy:HG #
# perf script --cpu 4 --pid 0 | head
swapper 0 [4] 319242.043904: 58 cpu/mem-loads,ldlat=30/P: ffff8d3e49c0e688 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 290 0 ffffffff9a13eb2c __update_load_avg_cfs_rq+0x9c (vmlinux) 9c0e688
swapper 0 [4] 319242.142295: 39 cpu/mem-loads,ldlat=30/P: ffff8d44865f2408 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 335 0 ffffffff9a13eecd update_rt_rq_load_avg+0x17d (vmlinux) 6465f2408
swapper 0 [4] 319242.143587: 99614 cpu/mem-stores/P: ffff8d4486500028 5080184 |OP STORE|LVL L1 miss|SNP N/A|TLB N/A|LCK N/A|BLK N/A 0 0 ffffffff9a001c2f __switch_to_asm+0x1f (vmlinux) 646500028
swapper 0 [4] 319242.174494: 33 cpu/mem-loads,ldlat=30/P: ffff8d3f595ddc38 11a68201042 |OP LOAD|LVL Local RAM or RAM hit|SNP Hit|TLB L1 or L2 hit|LCK No|BLK N/A 176 0 ffffffff9a13e78d __update_load_avg_se+0x1d (vmlinux) 1195ddc38
swapper 0 [4] 319242.178002: 27 cpu/mem-loads,ldlat=30/P: ffff8d44865312c0 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 56 0 ffffffff9a07d74f switch_mm_irqs_off+0x16f (vmlinux) 6465312c0
swapper 0 [4] 319242.212148: 23 cpu/mem-loads,ldlat=30/P: ffff8d44865322e8 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 55 0 ffffffff9a140c22 irqtime_account_process_tick+0xa2 (vmlinux) 6465322e8
swapper 0 [4] 319242.217357: 18 cpu/mem-loads,ldlat=30/P: ffff8d4486532490 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 125 0 ffffffff9a140076 update_irq_load_avg+0xf6 (vmlinux) 646532490
swapper 0 [4] 319242.220573: 15 cpu/mem-loads,ldlat=30/P: ffff8d3f4f35f218 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 383 0 ffffffff9a73b407 rb_erase+0x7 (vmlinux) 10f35f218
swapper 0 [4] 319242.240176: 15 cpu/mem-loads,ldlat=30/P: ffff8d3f6b617be0 10650100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L2 miss|LCK No|BLK N/A 184 0 ffffffff9a129fbb update_blocked_averages+0x1fb (vmlinux) 12b617be0
swapper 0 [4] 319242.243441: 8849 cpu/mem-stores/P: ffff8d3f40c2b1a4 5080144 |OP STORE|LVL L1 hit|SNP N/A|TLB N/A|LCK N/A|BLK N/A 0 0 ffffffff9ad68aed rcu_eqs_exit.constprop.0+0x3d (vmlinux) 100c2b1a4
#
# perf evlist -v | head -1 cpu/mem-loads,ldlat=30/P: type: 4, size: 128, config: 0x1cd, \ { sample_period, sample_freq }: 4000, \ sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|PHYS_ADDR| \ WEIGHT_STRUCT, \ read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, \ sample_id_all: 1, { bp_addr, config1 }: 0x1f #
# perf report --header-only # captured on : Fri Sep 9 11:10:04 2022 # hostname : quaco # os release : 5.18.17-200.fc36.x86_64 # perf version : 6.0.rc3.gfaf59ec8c3c3 # arch : x86_64 # nrcpus online : 8 # nrcpus avail : 8 # cpudesc : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz # total memory : 24487388 kB # cmdline : /home/acme/bin/perf c2c record -a sleep 1m # event : name = cpu/mem-loads,ldlat=30/P, freq = 4000, sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\ PHYS_ADDR|WEIGHT_STRUCT # event : name = cpu/mem-stores/P, freq = 4000, sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\ PHYS_ADDR|WEIGHT_STRUCT
# perf c2c report --stats Total records : 3223429 Locked Load/Store Operations : 112673 Load Operations : 1387118 Loads - uncacheable : 1 Loads - IO : 4 Loads - Miss : 142 Loads - no mapping : 2350 Load Fill Buffer Hit : 455747 Load L1D hit : 264355 Load L2D hit : 29304 Load LLC hit : 534642 Load Local HITM : 629 Load Remote HITM : 0 Load Remote HIT : 0
# perf c2c report --stdio ================================================= Shared Data Cache Line Table ================================================= # # -- Cacheline -- -- Load Hitm ---- Tot Total Total -- Stores -- # Idx Address Hitm Tot LclHitm rec Loads Stores L1Hit L1Miss # ... ................ ..... ... ....... ... ..... ...... ............ # 0 ffff8d449e7d6380 8.43% 53 53 510 499 11 11 0 1 ffff8d4058209340 6.20% 39 39 371 135 236 223 13 2 ffff8d449e7ff400 5.88% 37 37 501 479 22 22 0 3 ffffffff9bf53980 4.93% 31 31 233 208 25 24 1 4 ffff8d3f49ebd280 3.18% 20 20 162 153 9 9 0 5 ffff8d3f420d4880 2.86% 18 18 126 121 5 5 0
Cacheline 0xffff8d449e7ff400 -HITM- CL --- cycles --- Tot cpu LclHitm Off Code address lcl hitm load rec cnt Symbol Object Source:Line 97.30% 0x0 0xffffffff9a2d293b 113 44 454 8 __mod_node_page_state [kernel] vmstat.c:379 0.00% 0x8 0xffffffff9a2d29bb 0 112 40 8 __mod_node_page_state [kernel] atomic64_64.h:46 2.70% 0x18 0xffffffff9a2d2be5 959 103 2 2 refresh_cpu_vm_stats [kernel] atomic64_64.h:46
$ perf probe -L vmstat.c:379 | head 379 struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; 380 s8 __percpu *p = pcp->vm_node_stat_diff + item; 381 long x; long t; if (vmstat_item_in_bytes(item)) { /* * Only cgroups use subpage accounting right now; at * the global level, these items still change in $
$ pfunct __mod_node_page_state void __mod_node_page_state(struct pglist_data * pgdat, enum node_stat_item item, long int delta); $ $ pahole pglist_data | grep -B2 -A6 per_cpu_nodestats /* --- cacheline 2704 boundary (173056 bytes) --- */ struct zone_padding _pad2_; /* 173056 0 */ struct per_cpu_nodestat *per_cpu_nodestats __attribute__((__aligned__(64))); /* 173056 8 */ atomic_long_t vm_stat[41]; /* 173064 328 */ /* size: 173440, cachelines: 2710, members: 32 */ /* sum members: 173309, holes: 6, sum holes: 83 */ /* padding: 48 */ /* forced alignments: 2 */ $
# perf annotate --stdio2 refresh_cpu_vm_stats <SNIP> refresh_cpu_vm_stats() /usr/lib/debug/lib/modules/5.18.17-200.fc36.x86_64/vmlinux ffffffff812d29e0: static int refresh_cpu_vm_stats(bool do_pagesets) struct pglist_data *pgdat; struct zone *zone; <SNIP> for_each_online_pgdat(pgdat) { <SNIP> 1f3: cmpxchg %r8b,%gs:(%rdx) ↑ jne 1f3 movsbl %al,%edi if (v) { test %edi,%edi ↓ je 20b atomic_long_add(v, &pgdat->vm_stat[i])
* Update the zone counters for the current cpu. * Note that refresh_cpu_vm_stats strives to only access node local * memory. The per cpu pagesets on remote zones are placed in the * memory local to the processor using that pageset. So the loop over * all zones will access a series of cachelines local to the processor. * The call to zone_page_state_add updates the cachelines with the stats * in the remote zone struct as well as the global cachelines with the * global counters. These could cause remote node cache line bouncing * and will have to be only done when necessary. * The function returns the number of global counters updated.
$ wc -l tools/perf/util/bpf_skel/*.bpf.c 191 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c 78 tools/perf/util/bpf_skel/bperf_follower.bpf.c 55 tools/perf/util/bpf_skel/bperf_leader.bpf.c 92 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c 116 tools/perf/util/bpf_skel/func_latency.bpf.c 383 tools/perf/util/bpf_skel/kwork_trace.bpf.c 175 tools/perf/util/bpf_skel/lock_contention.bpf.c 273 tools/perf/util/bpf_skel/off_cpu.bpf.c 1363 total $
$ sudo perf lock contention -b ^C contended total wait max wait avg wait type caller 42 192.67 us 13.64 us 4.59 us spinlock queue_work_on+0x20 23 85.54 us 10.28 us 3.72 us spinlock worker_thread+0x14a 6 13.92 us 6.51 us 2.32 us mutex kernfs_iop_permission+0x30 3 11.59 us 10.04 us 3.86 us mutex kernfs_dop_revalidate+0x3c 1 7.52 us 7.52 us 7.52 us spinlock kthread+0x115 1 7.24 us 7.24 us 7.24 us rwlock:W sys_epoll_wait+0x148 2 7.08 us 3.99 us 3.54 us spinlock delayed_work_timer_fn+0x1b 1 6.41 us 6.41 us 6.41 us spinlock idle_balance+0xa06 2 2.50 us 1.83 us 1.25 us mutex kernfs_iop_lookup+0x2f 1 1.71 us 1.71 us 1.71 us mutex kernfs_iop_getattr+0x2c ...