#定位系统性能瓶颈# perf

时间：2015-07-14 15:41:21 阅读：123 评论：0 收藏：0 [点我收藏+]

perf是一个基于Linux 2.6+的调优工具，在liunx性能测量抽象出一套适应于各种不同CPU硬件的通用测量方法，其数据来源于比较新的linux内核提供的 perf_event 接口

系统事件:
perf tool 支持一系列可计算的事件类型。该工具和底层内核接口可以监测来自不同来源的事件。
例如,一些事件是来源于纯粹的内核计数器,这些event在这种情况下被称为软件事件。例子包括:context-switch、minor-fault等。

事件的另一个来源是处理器本身及其性能监控装置(PMU)。它提供了一系列事件来监测 micro-architectural 事件，如cycle的数量、指令退出、L1缓存不命中等等。

这些事件被称为PMU硬件事件或硬件事件。每种处理器型号都有各种不同的硬件事件类型。

perf_events接口还提供了一组通用的硬件事件名称。
在每个处理器,这些事件映射到实际的CPU提供的实际存在的事件类型,如果不存在则不能使用。
而且这些也被称为硬件事件（hardware events）和硬件缓存事件（hardware cache events）。

最后,还有各种 tracepoint事件是由内核ftrace框架实现的，而且这些事件都是2.6.3x或更加新的内核才支持。

Tracepoints:
Tracepoint 是散落在内核源代码中的一些 hook，一旦使能，它们便可以在特定的代码被运行到时被触发，这一特性可以被各种 trace/debug 工具所使用。Perf 就是该特性的用户之一。
假如您想知道在应用程序运行期间，内核内存管理模块的行为，便可以利用潜伏在 slab 分配器中的 tracepoint。当内核运行到这些 tracepoint 时，便会通知 perf。
Perf 将 tracepoint 产生的事件记录下来，生成报告，通过分析这些报告，调优人员便可以了解程序运行时期内核的种种细节，对性能症状作出更准确的诊断。

常用perf工具：
perf list：列出所有事件类型
perf top：类似top的全局监控工具
perf stat：对某个程序或者特定进程进行监控

perf record & perf report：先记录再通过详细报告的形式打印事件信息

perf list

[root@localhost test]# perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
  
  sunrpc:rpc_call_status                             [Tracepoint event]
  sunrpc:rpc_bind_status                             [Tracepoint event]
  sunrpc:rpc_connect_status                          [Tracepoint event]
  sunrpc:rpc_task_begin                              [Tracepoint event]
  sunrpc:rpc_task_run_action                         [Tracepoint event]
  sunrpc:rpc_task_complete                           [Tracepoint event]
  sunrpc:rpc_task_sleep                              [Tracepoint event]
  sunrpc:rpc_task_wakeup                             [Tracepoint event]
  ext4:ext4_free_inode                               [Tracepoint event]
  ext4:ext4_request_inode                            [Tracepoint event]
  ext4:ext4_allocate_inode                           [Tracepoint event]
  ext4:ext4_write_begin                              [Tracepoint event]
  ext4:ext4_da_write_begin                           [Tracepoint event]
	省略............

其中通过perf list命令运行在不同的系统会列出不同的结果，在 2.6.35 版本的内核中，该列表已经相当的长，但无论有多少，我们可以将它们划分为三类：

Hardware Event 是由 PMU 硬件产生的事件，比如 cache 命中，当您需要了解程序对硬件特性的使用情况时，便需要对这些事件进行采样；
Software Event 是内核软件产生的事件，比如进程切换，tick 数等 ;
Tracepoint event 是内核中的静态 tracepoint 所触发的事件，这些 tracepoint 用来判断程序运行期间内核的行为细节，比如 slab 分配器的分配次数等。

在操作系统运行过程中，关于系统调用的调度优先级别，从高到低是这样的：
硬中断->软中断->实时进程->内核进程->用户进程

Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request（硬中断）
Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.（软中断）
Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.（实时进程，会抢占其他非实时进程时间片）
Kernel (System) Processes – All kernel processing is handled at this level of priority. （内核进程）
User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.（用户进程）

perf top

Samples: 475  of event 'cycles', Event count (approx.): 165419249
 10.30%  [kernel]          [k] kallsyms_expand_symbol
  7.75%  perf              [.] 0x000000000005f190
  5.68%  libc-2.12.so      [.] memcpy
  5.45%  [kernel]          [k] format_decode
  5.45%  libc-2.12.so      [.] __strcmp_sse42
  5.45%  perf              [.] symbols__insert
  5.24%  [kernel]          [k] memcpy
  4.86%  [kernel]          [k] vsnprintf
  4.62%  [kernel]          [k] string
  4.45%  [kernel]          [k] number
  3.35%  libc-2.12.so      [.] __strstr_sse42
  3.15%  perf              [.] hex2u64
  2.72%  perf              [.] rb_next
  2.10%  [kernel]          [k] pointer
  1.89%  [kernel]          [k] security_real_capable_noaudit
  1.89%  [kernel]          [k] strnlen
  1.88%  perf              [.] rb_insert_color
  1.68%  libc-2.12.so      [.] _int_malloc
  1.68%  libc-2.12.so      [.] _IO_getdelim
  1.28%  [kernel]          [k] update_iter
  1.05%  [kernel]          [k] s_show
  1.05%  libc-2.12.so      [.] memchr
  1.05%  [kernel]          [k] get_task_cred
  0.88%  [kernel]          [k] seq_read
  0.85%  [kernel]          [k] clear_page_c
  0.84%  perf              [.] symbol__new
  0.63%  libc-2.12.so      [.] __libc_calloc
  0.63%  [kernel]          [k] copy_user_generic_string
  0.63%  [kernel]          [k] seq_vprintf
  0.63%  perf              [.] kallsyms__parse
  0.63%  libc-2.12.so      [.] _IO_feof
  0.63%  [kernel]          [k] strcmp
  0.63%  [kernel]          [k] page_fault
  0.63%  perf              [.] dso__load_sym
  0.42%  perf              [.] strstr@plt
  0.42%  libc-2.12.so      [.] __strchr_sse42
  0.42%  [kernel]          [k] seq_printf
  0.42%  libc-2.12.so      [.] __memset_sse2
  0.42%  libelf-0.152.so   [.] gelf_getsym
  0.25%  [kernel]          [k] __mutex_init
  0.21%  [kernel]          [k] _spin_lock
  0.21%  [kernel]          [k] s_next
  0.21%  [kernel]          [k] fsnotify
  0.21%  [kernel]          [k] sys_read
  0.21%  [kernel]          [k] __link_path_walk
  0.21%  [kernel]          [k] intel_pmu_disable_all
  0.21%  [kernel]          [k] mem_cgroup_charge_common
  0.21%  [kernel]          [k] update_shares
  0.21%  [kernel]          [k] native_read_tsc
  0.21%  [kernel]          [k] apic_timer_interrupt
  0.21%  [kernel]          [k] auditsys
  0.21%  [kernel]          [k] do_softirq
  0.21%  [kernel]          [k] update_process_times
  0.21%  [kernel]          [k] native_sched_clock
  0.21%  [kernel]          [k] mutex_lock
  0.21%  [kernel]          [k] module_get_kallsym

perf top 类似top的作用，可以整体查看系统的全部事件并且按数量排序。

perf stat

perf stat可以直接接上命令或者通过pid形式对进程进行性能统计，完成后会打印出这个过程的性能数据统计

[root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
记录了1000000+0 的读入
记录了1000000+0 的写出
512000000字节(512 MB)已复制，0.790059 秒，648 MB/秒

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

        791.176286 task-clock                #    1.000 CPUs utilized          
                 1 context-switches          #    0.001 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               247 page-faults               #    0.312 K/sec                  
     1,248,519,452 cycles                    #    1.578 GHz                     [83.29%]
       366,166,452 stalled-cycles-frontend   #   29.33% frontend cycles idle    [83.33%]
       155,120,002 stalled-cycles-backend    #   12.42% backend  cycles idle    [66.62%]
     1,947,919,494 instructions              #    1.56  insns per cycle        
                                             #    0.19  stalled cycles per insn [83.31%]
       355,465,524 branches                  #  449.287 M/sec                   [83.41%]
         2,021,648 branch-misses             #    0.57% of all branches         [83.42%]

       0.791116595 seconds time elapsed

通过pid指定某进程进行性能监测，按ctrl-C结束，然后计算这段时间内的性能统计

[root@localhost test]# perf stat -p 18669
^C
 Performance counter stats for process id '18669':

          1.520699 task-clock                #    0.001 CPUs utilized          
                56 context-switches          #    0.037 M/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
                 0 page-faults               #    0.000 K/sec                  
         2,178,120 cycles                    #    1.432 GHz                     [63.18%]
         1,410,393 stalled-cycles-frontend   #   64.75% frontend cycles idle    [90.94%]
           942,665 stalled-cycles-backend    #   43.28% backend  cycles idle   
         1,067,824 instructions              #    0.49  insns per cycle        
                                             #    1.32  stalled cycles per insn
           193,104 branches                  #  126.984 M/sec                  
            14,544 branch-misses             #    7.53% of all branches         [61.93%]

       2.061889979 seconds time elapsed

[root@localhost test]# perf stat

 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
        --filter <filter>
                          event filter
    -i, --no-inherit      child tasks do not inherit counters
    -p, --pid <pid>       stat events on existing process id
    -t, --tid <tid>       stat events on existing thread id
    -a, --all-cpus        system-wide collection from all CPUs
    -g, --group           put the counters into a counter group
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100)
    -n, --null            null run - dont start any counters
    -d, --detailed        detailed run - start a lot of events
    -S, --sync            call sync() before starting a run
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -A, --no-aggr         disable CPU count aggregation
    -x, --field-separator <separator>
                          print counts with custom separator
    -G, --cgroup <name>   monitor event in cgroup name only
    -o, --output <file>   output file name
        --append          append to the output file
        --log-fd <n>      log output to fd, instead of stderr
        --pre <command>   command to run prior to the measured command
        --post <command>  command to run after to the measured command

perf stat 还有其他有用的选项如 -e 是匹配指定事件

perf stat -e cycles,instructions,cache-misses [...]

也可以指定监控的级别

perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000

具体含义是：

Modifiers	Description	Example
u	monitor at priv level 3, 2, 1 (user)	event:u
k	monitor at priv level 0 (kernel)	event:k
h	monitor hypervisor events on a virtualization environment	event:h
H	monitor host machine on a virtualization environment	event:H
G	monitor guest machine on a virtualization environment	event:G

这些修饰符其实相当于flag标记的布尔类型，u是用户态，k是内核态，后面三个是和虚拟化相关的。

操作系统资源一般分为四大部分：CPU、内存、IO、网络

容易出现瓶颈的一般是CPU和IO，所以我这里写了两个小程序分别模拟这两种情况，然后通过使用perf一系列的工具找到症结所在。

程序1是计算素数，专门用来耗系统CPU的

#include <stdio.h>
int IsPrime(int num)
{
 int i=2;
 for(;i<=num/2;i++)
  if(0==num%i)
   return 0;
 return 1;
}
void main()
{
 int num;
 for(num=2;num<=10000000;num++)
    IsPrime(num);
    
}

程序2是开n个线程，每隔1ms写1K的数据到文件，读1k的数据。用来耗系统IO

#include <unistd.h>
#include <stdio.h>
#include <string>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <vector>
using namespace std;

bool gquit = false;

void* threadfun(void* param)
{
    string filename = *(string*)param;

    unlink(filename.c_str());
    
    int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666);
    if (fd < 0) return 0;

    string a(1024*16*1024, 'a');
    bool flag = true;
    while (!gquit)
    {

        if (flag)
        {
            write(fd, a.c_str(), a.length());
        }
        else
        {
            read(fd, &a[0], a.length());    
            lseek(fd, 0, SEEK_SET);
        }
        flag = !flag;
        usleep(1);
    }
    close(fd);
    return 0;
}


int main(int argc, char* argv[])
{
    if (argc != 2)
    {
        printf("need thread num\n");
        return 0;
    }
        
    int threadnum = atoi(argv[1]);

    vector<string> v(threadnum, "./testio.filex_");
    vector<pthread_t> vpid(threadnum, 0);
    for(int i = 0 ; i < threadnum ; ++ i)
    {
        pthread_attr_t attr;
        pthread_attr_init(&attr);

        static char buf[32];
        snprintf(buf, sizeof(buf), "%d", i);
        v[i] += buf;
        pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]); 
        
    }
    printf("press q to quit\n");    
    
    while(getchar() != 'q');

    gquit = true;
    for(int i = 0 ; i < threadnum ; ++ i)
    {
        void* ret = NULL;
        pthread_join(vpid[i], &ret);    
        unlink(v[i].c_str());
    }   
    return 0;
}

我先运行程序1后，通过perf top进行全局监控

Samples: 1K of event 'cycles', Event count (approx.): 690036067
 74.14%  cpu               [.] IsPrime
  2.45%  [kernel]          [k] kallsyms_expand_symbol
  2.21%  perf              [.] 0x000000000005f225
  1.65%  perf              [.] symbols__insert
  1.50%  perf              [.] hex2u64
  1.41%  libc-2.12.so      [.] __strcmp_sse42
  1.26%  [kernel]          [k] string
  1.25%  [kernel]          [k] format_decode
  1.10%  [kernel]          [k] vsnprintf
  0.95%  [kernel]          [k] strnlen
  0.85%  [kernel]          [k] memcpy
  0.80%  libc-2.12.so      [.] memcpy
  0.70%  [kernel]          [k] number
  0.60%  perf              [.] rb_next
  0.60%  libc-2.12.so      [.] _int_malloc
  0.55%  libc-2.12.so      [.] __strstr_sse42
  0.55%  [kernel]          [k] pointer
  0.55%  perf              [.] rb_insert_color

可以看出排名第一是cpu这个进程的IsPrime函数占用了最多的CPU性能

然后通过perf stat 指定pid进行监控

[root@localhost test]# perf stat -p 10704
^C
 Performance counter stats for process id '10704':

       4067.488066 task-clock                #    1.001 CPUs utilized           [100.00%]
                 5 context-switches          #    0.001 K/sec                   [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
                 0 page-faults               #    0.000 K/sec                  
    10,743,902,971 cycles                    #    2.641 GHz                     [83.32%]
     5,863,649,174 stalled-cycles-frontend   #   54.58% frontend cycles idle    [83.32%]
     1,347,218,352 stalled-cycles-backend    #   12.54% backend  cycles idle    [66.69%]
    14,625,550,503 instructions              #    1.36  insns per cycle        
                                             #    0.40  stalled cycles per insn [83.34%]
     1,950,981,334 branches                  #  479.653 M/sec                   [83.35%]
            46,273 branch-misses             #    0.00% of all branches         [83.32%]

       4.064800952 seconds time elapsed

可以看到各种系统事件的统计信息

然后使用perf record和report进行更加具体的分析

[root@localhost test]# perf record -g -- ./cpu    
^C[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ]
./cpu: Interrupt


[root@localhost test]# perf report
Samples: 11K of event 'cycles', Event count (approx.):6815943846
+ 99.70% cpu cpu [.] IsPrime
+ 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt
+ 0.03% cpu cpu [.] main
+ 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe
+ 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore
+ 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
+ 0.01% cpu [kernel.kallsyms] [k] rb_insert_color
+ 0.01% cpu [kernel.kallsyms] [k] raise_softirq
+ 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end
+ 0.01% cpu [kernel.kallsyms] [k] scheduler_tick
+ 0.01% cpu [kernel.kallsyms] [k] do_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave
+ 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt
+ 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock
+ 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer
+ 0.01% cpu [kernel.kallsyms] [k] idle_cpu

非常清晰可以看到各种函数的调用情况，由于这个程序是个很简单的代码，所以容易看出问题所在，但是如果是大项目的某个功能出现问题，通过perf工具就可以很方便地定位问题，后面再看看一个占用IO的例子。

同样也是先启动程序2，先通过perf top、perf stat -p pid这两个命令先定位是哪个进程导致性能问题

Samples: 394K of event 'cycles', Event count (approx.): 188517759438
 75.39%  [kernel]             [k] _spin_lock
 11.39%  [kernel]             [k] copy_user_generic_string
  1.89%  [jbd2]               [k] jbd2_journal_stop
  1.66%  [kernel]             [k] find_get_page
  1.51%  [jbd2]               [k] start_this_handle
  0.95%  [ext4]               [k] ext4_da_write_end
  0.76%  [kernel]             [k] iov_iter_fault_in_readable
  0.74%  [ext4]               [k] ext4_journal_start_sb
  0.56%  [jbd2]               [k] __jbd2_log_space_left
  0.44%  [ext4]               [k] __ext4_journal_stop
  0.44%  [kernel]             [k] __block_prepare_write
  0.41%  [kernel]             [k] __wake_up_bit
  0.35%  [kernel]             [k] _cond_resched
  0.32%  [kernel]             [k] kmem_cache_free

可以看到很多 spin lock的事件和各种文件操作事件

直接通过perf record和report进行分析，得出结果

[root@localhost test]# perf record -g -- ./io 10
press q to quit
q
[ perf record: Woken up 83 times to write data ]
[ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ]

[root@localhost test]# perf report
Samples: 138K of event 'cycles', Event count (approx.): 64858537028
+  65.87%  io  [kernel.kallsyms]    [k] _spin_lock
+  14.19%  io  [kernel.kallsyms]    [k] copy_user_generic_string
+   2.74%  io  [jbd2]               [k] jbd2_journal_stop
+   2.46%  io  [kernel.kallsyms]    [k] find_get_page
+   2.11%  io  [jbd2]               [k] start_this_handle
+   1.06%  io  [kernel.kallsyms]    [k] iov_iter_fault_in_readable
+   1.05%  io  [ext4]               [k] ext4_journal_start_sb
+   0.85%  io  [ext4]               [k] ext4_da_write_end
+   0.73%  io  [jbd2]               [k] __jbd2_log_space_left
+   0.66%  io  [kernel.kallsyms]    [k] generic_write_end
+   0.61%  io  [ext4]               [k] __ext4_journal_stop
+   0.57%  io  [kernel.kallsyms]    [k] __block_prepare_write
+   0.54%  io  [kernel.kallsyms]    [k] __wake_up_bit

好的，以上就是perf常用的工具，当然perf不止这几个工具，但这几个是最基础最常用的工具

通过查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 可以了解更多信息

#定位系统性能瓶颈# perf

标签：linux 性能内核 hardware

原文地址：http://blog.csdn.net/jeffreynicole/article/details/46877033

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行