
            ------------------------------------------------------
                                    pfmon-2.0
	    A tool to collect monitoring information for Linux/ia64
            ------------------------------------------------------
		   Copyright (c) 2001-2002 Hewlett-Packard Company
		                 Stephane Eranian <eranian@hpl.hp.com>
	

Pfmon is a performance monitoring tool uniquely designed for Linux/ia64.
It does not work with Linux/ia32. It is meant as a sample tool to demonstrate
how to use the perfmon subsystem provided by the Linux/ia64 as of version 2.4.0.
This tool uses the powerful IA-64 Performance Monitoring Unit (PMU) to 
do counting and sampling on unmodified binaries or for the entire system.


This document is an attempt at providing some documentation on how to use pfmon. 
The content covers pfmon-2.0 uniquely.

                -----> YOU MUST at least HAVE kernel v2.4.18 <-----

1/ Introduction

   Pfmon can be used to monitor unmodified binaries in its per-process mode ans it can 
   also be used to run system wide monitoring sessions. Such sessions are active across 
   all processes executing on a given CPU. Pfmon can launch a system wide session 
   on a dedicated CPU or set of CPUs in parallel. 

   Pfmon can monitor activities happening at the user and/or kernel level for both
   type of sessions. 

   Pfmon can be used to collect basic event counts. It can also be used to sample 
   program or system execution.

   In per-process mode, pfmon can only monitor the first process (task). Subsequent processes or 
   threads created by that  initial process will not be monitored. 

   Pfmon can run on any IA-64 CPU model and provides the minimal features mandated
   by the architecture but it also provides model specific extensions. For instance,
   on Itanium pfmon has support for the EAR and BTB features.

   Pfmon is based on a generic helper library called libpfm which is included in this package.
   The library is not specific to pfmon and can be used directly by other programs as is
   demonstrated in the set of examples also included in this package. Both the library and
   pfmon have a modular architecture which makes it easier to support new PMU models as they
   become available.

   In the remainder of this document, we describe the key options and features of pfmon
   which are available on all CPU (PMU) models. Please refer to the model
   specific documentation for advanced features.

2/ pfmon options

   The set of command line options provided by pfmon depends on the host PMU. It is possible
   to compile pfmon for more than one PMU model and then it will auto-detect
   the host PMU and provide the corresponding set of options.

   The options common to all PMU models are as follows:

   -h, --help                           display the list of supported options

   -V, --version                        display pfmon version information

   -l[regex], --show-event-list[=regex] show list of supported events by host PMU

   -i <event>, --event-info=event       get information about a particular event.

   -u, -3, --user-level                 monitor at the user level for all events. 
   					Per-event setting is possible with --priv-levels

   -k, -0, --kernel-level               monitor at the kernel level for all events. 
   					Per-event setting is possible with --priv-levels

   -2					monitor at privilege level 2 

   -1					monitor at privilege level 1 

   -e, --events=ev1,ev2,...             select events to monitor. There should be no space
					between the events.The number of events that you can specify 
					is dependent on the underlying PMU model. Four events is typical.

   -I,--info                            list the compiled in PMU models supported by pfmon and
   					detected host PMU as well as sampling output formats.

   --debug                              enable debug prints.

   --verbose                            print more information during execution.

   --outfile=filename                   print counts in a file

   --append                             when used with --outfile, will open the file in append mode.

   --overflow-block                     block the monitored program on overflow 
   					notifications (per process mode only).

   --system-wide                        create a system wide monitoring session. Default session type 
   					is per process.

   --cpu-mask=0xn                       bitmask indicating on which CPU to start system wide 
   					monitoring. When this option is not specified, pfmon will
					monitor on all CPUs.

   -S format, --smpl-output-info=format display information about a sampling output format.

   -t secs, --session-timeout=secs      duration of the session in seconds. In per process mode, 
   					the process will get killed if the timeout expires.

   --smpl-outfile=filename              save sampling results in a file.

   --smpl-entries=val                   size of the sampling buffer in number of entries 
   					(default=2048).

   --long-smpl-periods=val1,val2,...    set sampling periods for each event after user level 
   					notification.

   --short-smpl-periods=val1,val2,...   set sampling period for each event.

   --with-header                        generate a machine description header with results.

   --aggregate-results                  aggregate counts and sampling buffer outputs when running
   					system wide monitoring on multiple CPUs.

   --tigger-start-address=addr		start monitoring only when execution reaches addr (code) for 
   					the first time. trigger stop address is not currently supported

   --priv-levels=lvl1,lvl2,...          set privilege level per event. lvl can be any combination of
   				 	u or 3 (user), k or 0 ( kernel), 1 (priv level 1), 2 (priv level2 ).
					Unspecified events will get global setting which is user only by default.

   --show-time                          show real,user, and system time for the executed command.

   --us-counter-format                  print counters using commas (1,024).

   --eu-counter-format                  print counters using points (1.024).

   --hex-counter-format                 print counters in hexadecimal (0x400).

   --smpl-output-format=fmt             select fmt as sampling output format, use -L to list formats. 

   --long-show-events[=regex]		display detailed information about matching events in a single (easy grep)

   --symbol-file=filename		use the ELF archive filename to look for symbols

   --sysmap-file=filename		use the System.map filename to look for symbols

   --check-events-only			check that the event combination is valid and exit (no measurement)

   --smpl-periods-random=mask1:seed1,... randomize both the short and long sampling period. The mask indicate
   					the significant bits to keep in the randomly generated value. The seed is
					use to initialize the pseudo-random number generator. You can use a different
					mask and seed per event.

   --trigger-start-delay=secs		number of seconds before activating monitoring

   --smpl-print-counts			print counter results when sampling (off by default)

   --exclude-idle			exclude idle tasks from system wide monitoring

3/ Getting event information with pfmon

   The list of events supported by pfmon depends on the host PMU. You can get the list
   of supported events using the following pfmon option:

   % pfmon -l
   CPU_CYCLES
   IA64_INST_RETIRED
   IA64_TAGGED_INST_RETIRED_PMC8
   IA64_TAGGED_INST_RETIRED_PMC9
   INST_DISPERSED
   EXPL_STOPBITS
   ALL_STOPS_DISPERSED
   IA32_INST_RETIRED
   ISA_TRANSITIONS
   NOPS_RETIRED
   ....

   If you specify an argument to the -l option (no space between l and the
   argument), it is interpreted as a regular expression and all matching events 
   will be listed:

   % pfmon -ll1d
   L1D_READ_FORCED_MISSES_RETIRED
   L1D_READ_MISSES_RETIRED
   L1D_READS_RETIRED
   PIPELINE_FLUSH_L1D_WAYMP_FLUSH

   You can get more detailed information about each event using the following option:

   % pfmon -i nops_retired
   Name   : NOPS_RETIRED
   VCode  : 0x30
   Code   : 0x30
   PMD/PMC: [ 4 5 ]
   EAR    : No (N/A)
   Umask  : None
   BTB    : No
   Thres  : 6
   Qual   : [Instruction Address Range] [OpCode match]

   Pfmon is case insensitive for event names. Here you see some details about the event.
   The first 4 lines are generic and provided on all PMU models even though the codes may
   vary:
  	- Code is the event code used by the PMU. 

	- Vcode is a libpfm internal event code which encapsulates the event code and other
	  information describing the type of the event. For simple events, the two codes are
	  usually identical.

	- PMD/PMC: list the counting monitors on which this event can be programmed. Not 
		   all events can necessarily be programmed on all available counting 
		   monitors. This constraint is taken care of by the libpfm library.

   Here the remaining information is specific to the Itanium 2 PMU.

   Even with the -i option, you can use a regular expression for the event:
   % pfmon -i'writes$'
   Name   : L2_DATA_REFERENCES_WRITES
   VCode  : 0x20069
   Code   : 0x69
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0010
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 2  (Threshold [0-1])
   Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range] 

   On some PMU models (currently Itanium2), the events information contains a
   text description of the event.

   Events can be specified using their code:
   % pfmon -i 0x45
   Name   : L2_INST_PREFETCHES
   VCode  : 0x45
   Code   : 0x45
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 1  (Threshold 0)
   Qual   : [Instruction Address Range] 
   Group  : None
   Set    : None
   Desc   : L2 Instruction Prefetch Requests

   Information about what each event measures can be found in the relevant CPU model specific
   micro-architecture documentation.

   The architecture imposes that only two events be defined by all PMUs:

   	- CPU_CYCLES        : the number of elapsed CPU cycles.
	- IA64_INST_RETIRED : the number of instructions retired. 

   Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific
   event names may not be exactly the same, however, pfmon and especially the library it uses
   (libpfm) will always ensure that those two events can always be called by the two names list
   above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon 
   also incorporates a generic PMU model which provides only the features defined by the 
   architecture, this includes the two events. If pfmon does not have specific support for the 
   host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find 
   out what PMU support is compiled into pfmon as follows:

   % pfmon -I
   detected host CPUs:  4-way 800MHz Itanium (Merced, C0)
   supported PMU models: [itanium2] [itanium] [generic] 
   detected host PMU: itanium
   supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] 
   pfmlib version: 2.0
   kernel perfmon version: 1.0

   It is possible to force pfmon to operate in generic mode even though it has support for the
   host CPU using the pfmon_gen command:

   % pfmon_gen -I
   forced libpfm to generic support
   detected host CPUs:  4-way 800MHz Itanium (Merced, C0)
   supported PMU models: [itanium2] [itanium] [generic] 
   detected host PMU: generic
   supported sampling outputs: [raw] [compact] [example] 
   pfmlib version: 2.0
   kernel perfmon version: 1.0

   % pfmon_gen -i CPU_CYCLES
   forced libpfm to generic support
   Name   : CPU_CYCLES
   VCode  : 0x12
   Code   : 0x12
   PMD/PMC: [ 4 5 6 7 ]

   The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always
   checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic
   support is compiled in, then pfmon will operate in generic mode. Such feature is useful when
   moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet.

3/ Basic counting with pfmon

   In generic mode, pfmon only supports the two architected events listed
   above. For comparison, the Itanium PMU supports about  230 events and the
   Itanium2 PMU about 470.

   No instrumentation of the program is required to monitor the system or a
   single process.

   a/ simple examples

   To collect counts on a specific command, you just need to launch it via pfmon, just like
   you would do with the time or strace command:

   % pfmon ls /var/spool
   anacron  at  cron  fax  lpd  mail  mqueue  news  rwho  samba  slrnpull  squid  up2date  uucp  uucppublic  vbox  voice
       2910724 CPU_CYCLES

   When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events,
   you can type:

   % pfmon -e cpu_cycles,IA64_inst_Retired ls /var/spool
   anacron  at  cron  fax  lpd  mail  mqueue  news  rwho  samba  slrnpull  squid  up2date  uucp  uucppublic  vbox  voice
   2984546 CPU_CYCLES
   2666884 IA64_INST_RETIRED

   As you can see, pfmon is not case sensitive with regards to event names. More than one event
   can be measured at a time using a comma separated list of events. You MUST not have space
   after the comma. 

   If the command you want to run takes options, you can clearly distinguish the options of 
   pfmon from the options of your command using the '--' symbol:

   % pfmon -e ia64_inst_retired -- ls -ial /dev/null
   210135 crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
              2709704 IA64_INST_RETIRED

   Otherwise, pfmon will stop parsing arguments as option as the first
   argument which does not start with a - or --.

   b/ privilege levels

   By default, pfmon monitors only what is going at the user level
   (application level). This is true for both per-process and system wide
   mode.

   It is possible to monitor at any of the 4 privilege levels provided by IA-64.
   It is also possible to monitor at several levels at the same time by specifying
   more than one level. The levels can be specified for all events or on a per-event
   basis.

   To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3).
   To set the level for each event, the --priv-levels option must be used.
 
   By default, pfmon only measures at the user level:

   % pfmon -e nops_retired ls

   counts the number of NOPS_RETIRED when ls is running at the user level only
   (equivalent to specifying -u or -3).

   % pfmon -k -e nops_retired ls

   counts the number of NOPS_RETIRED when ls is running at the kernel level only.

   % pfmon -k -u -e nops_retired ls

   counts the number of NOPS_RETIRED when ls is running at the kernel level
   or user level, i.e. all the time.

   It is possible to refine the settings on a per event basis using the
   --priv-levels option.

   % pfmon -e loads_retired,nops_retired ls

   Both events are measured at the user level only.

   % pfmon --priv-level=u,k -e loads_retired,nops_retired ls

   LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
   kernel level only.

   % pfmon --priv-level=,uk -e loads_retired,nops_retired ls

   LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
   user and kernel levels.

   % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls

   LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the
   kernel level only.


   c/ counter formats

   Pfmon can display the final counts in various formats. There are 4 formats
   defined. The default one is shown in the example above. To make is easier
   to read large numbers or to feed the number to other programs, pfmon
   supports: 

   --us-counter-format where the thousands, millions, billions are separated 
   with commas (US and UK style):

   % pfmon --us-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                  2,292,689 CPU_CYCLES

   --eu-counter-format  where the thousands, millions, billions are separated 
   with points (European style):

   % pfmon --eu-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                  1.703.898 CPU_CYCLES

   --hex-counter-format where the counts are shown in hexadecimal format:

   % pfmon --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
         0x000000000019c164 CPU_CYCLES

   d/ saving counts

   By default, the counts are printed on the controlling tty. However it is
   possible to save them in a file using the --outfile option:

   % pfmon --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
   % cat b
   0x000000000016a8b1 CPU_CYCLES
   
   It is possible to include a header with the results using the --with-header
   option. It will be printed on the controlling tty or saved in the output
   file. The header contains detailed information about the configuration of
   the host machine and on the monitoring session:

   % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
   % cat b
   #
   # date: Wed Nov 20 16:03:13 2002
   #
   # hostname: hpljumbo.hpl.hp.com
   #
   # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
   #
   # pfmon version: 2.0
   # kernel perfmon version: 1.0
   #
   #
   #
   # page size: 16384 bytes
   # CLK_TCK: 1024 ticks/second
   # CPU configured: 4
   # CPU online: 4
   # physical memory: 6827933696
   # physical memory available: 5606391808
   #
   # host CPUs:  4-way 800MHz Itanium (Merced, C0)
   #	PAL_A: 6.6.23
   #	PAL_B: 7.7.28
   #	Cache levels: 3 Unique caches: 4
   #	L1D:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
   #	L1I:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
   #	L2 :    98304 bytes, line  64 bytes, load_lat   6, store_lat   6
   #	L3 :  4194304 bytes, line  64 bytes, load_lat  21, store_lat  21
   #
   #
   # captured events:
   #	PMD4: CPU_CYCLES, user level(s)
   #
   # monitoring mode: per-process
   #
   #
   # instruction sets:
   #	PMD4: CPU_CYCLES, ia32/ia64
   #
   #
   # command: pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   #
   #
   #
            0x00000000001a8956 CPU_CYCLES

   e/ delayed start

   By default, pfmon will start monitoring at the first instruction of the
   program, i.e., the entry point when the privilege level is limited to user
   level. Even when kernel level monitoring is enabled nothing will be measured
   until the process leaves the kernel for the first time, after fork.

   Sometimes, it may be useful to delay the activation of monitoring until
   a certain point in the execution is reached. This is the case when the
   initialization must not be included in the counts. Pfmon provides two different
   ways to delay the point at which monitoring is turned on with the
   --trigger-start-address and --trigger-start-delay options.  
   
   The --trigger-start-address option only applies to per-process sessions and is ignored for 
   system-wide. It uses a code address to trigger monitoring. Once execution reaches 
   the bundle address specified with the option, the monitoring will be turned on and 
   will remain on until the program terminates. The address can be specified in hexadecimal or 
   a code symbol name can be provided. It is not possible to specify a kernel address, pfmon 
   will reject any such address. When an address is explicitely used, pfmon will not try to 
   validate it except by checking it is not in the kernel. The delayed start mechanism will 
   be used only the first time the address is reached.

   If main() is at address 0x40000000000004a0, then we can delay monitoring until main() is
   reached using:

   % pfmon --trigger-start-address=0x40000000000004a0 -e loads_retired foo
      74 LOADS_RETIRED

   or using the symbol table:

   % pfmon --trigger-start-address=main -e loads_retired foo
      74 LOADS_RETIRED

   IMPORTANT: Note that pfmon can ONLY lookup symbols in the "main" program and NOT in 
   any dynamically linked libraries. To allow complete coverage, the program MUST be
   linked statically.

   Whereas the same program executed without the trigger address, will get:
   % pfmon -e loads_retired foo
      1598 LOADS_RETIRED

   This example proves that the libc initialization used 1598-74=1524 loads all by itself.

   The --trigger-start-delay option uses time to delay monitoring. You simply specify a delay
   in seconds. When the delay expires, monitoring will be turned on. This options works
   for both per-process and system-wide monitoring. If the monitored process terminates before 
   the delay expires, then nothing gets measured. This applies to both per-process and
   system wide sessions using a process to delimit session. Note that the session effectively
   starts when monitoring is turned on. Hence, the --session-timeout is only armed when monitoring
   in turned on. 

   The following example will start monitoring 5 seconds in the execution of foo:

   % pfmon --trigger-start-delay=5 -e loads_retired foo

   The following example will start monitoring 5 seconds in the execution of foo and 
   for 10 seconds after that point:

   % pfmon --trigger-start-delay=5 --session-timeout=10 -e loads_retired foo


   f/ getting timing information

   It is possible to get a tim breakdown of the execution of the monitored command for
   both per-process and system-wide mode using the --show-time option. The output is similar
   to the time(1) command. For instance:

   % pfmon --show-time -e nops_retired ls /dev/null
   /dev/null
   real 0h00m00.098s user 0h00m00.000s sys 0h00m00.095s
                     247913 NOPS_RETIRED

   g/ Testing event combinations

   Sometimes it is handy to check if some events can be measured simultaneously without actually
   starting the monitoring session. The --check-events-only option of pfmon allows this mode of
   operation. It will check that the combination is valid and then exit. If the conbination is
   invalid, it will print out the reason and return with an exit value of 1, otherwise the exit
   value is 0. On Itanium2, for instance, you can try:

   % pfmon --check-events-only -e loads_retired,stores_retired
   event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time
   % echo $?
   1

   Note that in this mode, you do not need to specify a command to execute.

4/ System wide sessions

   When the --system-wide option is used, pfmon operates in system wide mode. This means that
   it does not monitor a specific program anymore but instead all the processes that execute
   on a specific set of CPUs. In this mode, you do no need to specify a command. You do not 
   need to be root to create a system wide session.

   A system wide session cannot co-exist with any per-process sessions. But a system wide session
   can run concurrently with other system wide sessions as long as they do not monitor the same
   set of CPUs. Of course multiple per-process sessions are possible.

   a/ selecting CPUs to monitor

   The --cpu-mask option can be used to restrict monitoring to a specific set of CPUs. When this
   option is not present, pfmon will automatically launch a system wide session on all available
   CPUs as reported by /proc/cpuinfo.

   So if the system has 2 available CPUS:

   % pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                248793 CPU_CYCLES
   CPU0                 60710 IA64_INST_RETIRED
   CPU1                 26690 CPU_CYCLES
   CPU1                  7706 IA64_INST_RETIRED

   A system wide session can monitor at any privilege level (kernel, user, or both).

   If you want to restrict to a specific CPU, you can use the --cpu-mask command:

   % pfmon --system-wide --cpu-mask=0x2 -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU1                 17841 CPU_CYCLES
   CPU1                  7577 IA64_INST_RETIRED

   The CPU mask is a bitmask where each bit represents a CPU. CPU are numbered starting at 0.
   So bit 0 represents CPU0, bit 1, CPU1 and so on. Therefore the above command will only
   monitor events happening on CPU1. More than one bit can be set in the mask. For instance,
   with --cpu-mask=0x3, pfmon will monitor on CPU0 and CPU1 at the same time.

   b/ delimiting a system wide session

   There are three ways to delimit a system wide session. By default, the
   session will terminate when the user press the <Enter> key.  It is also
   possible to use a timeout expressed in seconds. Finally, the session can
   also be delimited by the execution of a command. It will start when the
   command starts and stops when it terminates. Here are some examples:


   Monitor cpu_cycles and instruction retired on the first two CPUs at both 
   user and kernel levels and wait for a keypress to stop:

   % pfmon --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                   821818169 CPU_CYCLES
   CPU0                  1338893885 IA64_INST_RETIRED
   CPU1                   821813442 CPU_CYCLES
   CPU1                  1341176908 IA64_INST_RETIRED


   Monitor cpu_cycles and instruction retired on the first two CPUs at both 
   user and kernel levels for 10 seconds:

   % pfmon --session-timeout=10 --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <Session to end in 10 seconds>
   CPU0                  8003156088 CPU_CYCLES
   CPU0                 12800683300 IA64_INST_RETIRED
   CPU1                  8003106584 CPU_CYCLES
   CPU1                 12899764561 IA64_INST_RETIRED

   Monitor cpu_cycles and instruction retired on the first two CPUs at the
   user level only during the execution of the ls command (here obviously run
   on CPU0):

   % pfmon --cpu-mask=0x3 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
   CPU0                       46560 CPU_CYCLES
   CPU0                       26839 IA64_INST_RETIRED
   CPU1                        7514 CPU_CYCLES
   CPU1                        1184 IA64_INST_RETIRED


   c/ results aggregation

   It is possible to aggregate counts when monitoring more than one CPU:

   % pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
            852331455 CPU_CYCLES
           1387206797 IA64_INST_RETIRED

  In which case, the per CPU results are summed. Pfmon does not allow  different events to be
  monitored on different CPUs. For this you can run separate instances of pfmon with a different
  CPU mask, using a command line similar to:

  % pfmon --session-timeout=10 --cpu-mask=0x1 --system-wide -k -e cpu_cycles &
  % pfmon --session-timeout=10 --cpu-mask=0x2 --system-wide -k -e ia64_inst_retired &


5/ Dealing with symbols

    Whenever an option takes an address (code or data) as argument, it is
    possible to directly use a symbol name rather than use its address. 
    For instance, this is true for the --trigger-address option. The user
    has two ways to indicate where the find the symbol table. Pfmon
    can extract the symbol table using an ELF image directly. This is
    for instance what is done implicitely in per-process mode. Pfmon also 
    understands the System.map format which is typically used to save the 
    symbol table of the kernel.
    
    There are a couple of restrictions concerning the symbols. Pfmon cannot
    extract symbol information that is coming from dynamically linked
    libraries or modules. To avoid this problem, the program must be statically 
    linked and should not explicitely use dl_open(). 

    If the symbol table has been stripped, pfmon will not find any symbol. 
    In case the option requires a code address, pfmon will only look for matching 
    code symbols. Conversly, if the option requires a data address, pfmon will only 
    look for matching data symbols.

    By default, the symbols are automatically extracted from the command being 
    run. This is true in per process mode but also in system wide mode when 
    a command is specified. In case where symbols must be extract from an
    alternative ELF archive, then the user must use the --symbol-file option.
    The filename specified there must be a ELF/ia64 binary. 

    Note that the Linux/ia64 kernel is also an ELF/ia64 archive, however
    for most distribution the kernel image found in /boot/efi is oftentimes
    compressed. The compression scheme used for Linux/ia64 is different
    from the one used on Linux/ia32. The compressed is image is simply
    the ELF/ia64 image compressed with gzip. So it is possible to decompress
    it to get the original ELF archive. The main caveat is that most of the
    time the compressed image is stripped. Therefore the user must rely on
    the corresponding System.map file usually placed in /boot/efi. In this
    case, the user must explicitely specify the location of the System.map
    file via the --sysmap-file option.

    Here are a few examples on Itanium:

    Count the number of time main() is called in the noploop program:

    % file noploop
    noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
    % pfmon --checkpoint-func=main -e ia64_inst_retired noploop 10000

    Here the symbol information for main() is directly extracted from noploop
    itself.

    Count the number of time main() is called in the noploop-stripped program:

    % file noploop-stripped
    noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped
    % pfmon --symbol-file=noploop --checkpoint-func=main -e ia64_inst_retired noploop-stripped 1000

    Here noploop and noploop-stripped are the same programs except that the latter does not have 
    the symbol table anymore.

    
    Count the number of times sys_getpid() is called during the execution of noploop:
    % pfmon -k --symbol-file=/boot/efi/vmlinux-nostrip --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000

    Here we assume that the kernel file vmlinux was not stripped. If the
    kernel has been stripped, then we can use the System.map instead:
    % pfmon -k --sysmap-file=/boot/efi/System.map --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000

6/ Basic sampling with pfmon

   Pfmon has support for sampling on any events or combination of events. Samples are collected
   into a buffer which can then be written to a file or simply on the screen. 
   
   a/ principles

   Each sample is composed of two parts, a fixed size header which contains information about 
   the sample and a variable body which consists of a set of 64-bit values each one representing
   a PMD register representing the other events being monitored. All samples record the same set 
   of PMDs, this set is determined by pfmon based on what is being measured.
   
   The sampling buffer is controlled by the kernel but its size is configurable. By default
   pfmon uses a buffer with 2048 entries. This can be changed using the --smpl-entries option.

   The sampling works as follows:
   	1- the user specifies what needs to be recorded.
	2- the user specifies the sampling period and optional randomization parameters.
	3- at the end of a period, a sample is recorded into the buffer by the kernel.
	4- a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3.
	5- if the sampling buffer becomes full, pfmon is notified.
	6- pfmon processes the buffer, i.e., prints and/or saves the buffer.
	7- pfmon then notifies the kernel that it is done.
	8- the kernel reload a new sampling period and execution/monitoring resumes. we go back to 3.


   Pfmon (and the kernel) uses two sampling periods instead of just one. The first one is called
   short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used 
   in step 4, this is when the sampling buffer is not full after writing the sample. The 
   long-smpl-period is used in step 8 when the reload occurs after the buffer became full. 

   But why do we need 2 periods?

   As you might imagine there is some overhead is recording a sample. This overhead is
   increased even more when pfmon needs to get involved to drain the buffer. This operation
   can take some time and will inevitably introduce some noise in the measurements in the form
   of TLB and/or cache pollution. To try and hide this noise, it is sometimes beneficial to
   adjust the sampling period, i.e., make it larger to ensure that the next sample will not
   record an event that is the consequence of the overhead generated by the monitoring but rather
   a normal event occuring in the program/system being monitored. So it is expect that the 
   long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to 
   having only one sampling period. Note that the long-smpl-period is only used to set the 
   distance to the first sample recorded after the buffer is marked as empty again (step 7).


   b/ sampling output formats

   There are many ways in which the samples can be saved or printed on the
   screen. Pfmon has support for custom formats. Note that at this point, the
   kernel sampling buffer format is fixed. Here the customization happens in
   the tool. Pfmon comes with a set of output formats. Some of them can be
   used with any PMU models, others are specific to the Itanium or Itanium 2 
   PMUs. While all PMDs on all PMUs are 64 bits what they contains can vary
   from one PMU to the other. 

   You can figure out which formats are available for the host PMU by typing:
   % pfmon -I
   supported PMU models: [itanium2] [itanium] [generic] 
   detected host PMU: itanium
   supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] 

   You can get a short description of what each format does by using the -S
   option: 
   % pfmon -S detailed-itanium
   Name        : detailed-itanium
   Description : Details each event in clear text
   PMU models  : [itanium] 

   Some formats are supported on all PMU models, in which case they are listed
   as generic:
   % pfmon -S compact
   Name        : compact
   Description : Column-style raw values
   PMU models  : [generic]

   Pfmon does not have a format by default, therefore the user MUST provide a
   format when starting a sampling session.

   % pfmon --smpl-output-format=compact --long-smpl-periods=100000 ls
   0        14130    0  0x2000000000015771 0x0000582a9cf18e79 0x0010 100000 
   1        14130    0  0x2000000000015851 0x0000582a9cf34a40 0x0010 100000 
   2        14130    0  0x2000000000015941 0x0000582a9cf4e5e8 0x0010 100000 
   3        14130    0  0x2000000000023da0 0x0000582a9cf69db7 0x0010 100000 
   ....

   For more information about the various formats please refer to the source
   code :-<

   
   c/ some simple examples

   	Suppose you want to record how many instructions are retired every 50000 cycles, i.e.,
	you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in 
	each sample. This can be done as follows:

	% pfmon --smpl-output-format=detailed-itanium \
	  --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null

	The two periods are identical in this example because the number of instruction executed
	by the ls command is not influenced by the fact that we monitor. The syntax is such that
	the 50000 value of short-period applies to the first event specified in the event list.
	The same rule applies for long-period. 

	With pfmon it is possible to use more than one event as the 'sampling event'. You
	can also specify a sampling period for IA64_INST_RETIRED, in which case we take a sample
	whenever the first OR second period expires:


	% pfmon --smpl-output-format=detailed-itanium --short-smpl-period=50000,10000 \
	  --long-smpl-period=50000,10000 -e cpu_cycles,ia64_inst_retired ls

	Here a sample will be recorded every 50000 cpu cycles OR each time 10000 instructions have
	been retired.

   You do not necessarily need to specify both periods. If you specify one, then pfmon will use the value to
   initialize the other one. In other words, as soon as you specify only one period, the unspecified one will
   get the same value.

   Let us look at the information in the sampling buffer for the detailed-itanium format. For the first 
   example above, we get something like this printed on the screen:

   
   		/dev/null
		Entry 0 PID:1490 CPU:3 STAMP:0x39e28c5cf782 IIP:0x2000000000004c70
	        	OVFL: 4 
			PMD5  : 0x0000000000004708
		Entry 1 PID:1490 CPU:3 STAMP:0x39e28c5f8e0a IIP:0x2000000000026ee0
        		OVFL: 4  LAST_VAL: 5000
        		PMD5  : 0x0000000000007310
		Entry 2 PID:1490 CPU:3 STAMP:0x39e28c6273d2 IIP:0x2000000000025e40
        		OVFL: 4   LAST_VAL: 5000
        		PMD5  : 0x000000000000b5e6
		Entry 3 PID:1490 CPU:3 STAMP:0x39e28c63ef1b IIP:0x2000000000018490
        		OVFL: 4   LAST_VAL: 5000
        		PMD5  : 0x000000000001137f
		Entry 4 PID:1490 CPU:3 STAMP:0x39e28c64c6f5 IIP:0x2000000000024f60
        		OVFL: 4   LAST_VAL: 5000
        		PMD5  : 0x0000000000018a73
		Entry 5 PID:1490 CPU:3 STAMP:0x39e28c6596cb IIP:0x2000000000018490
        		OVFL: 4   LAST_VAL: 5000
        		PMD5  : 0x00000000000222df
		.....
   The first line is the output from the ls command. Next you see the entries extracted from the sampling buffer.
   Entry 0 is the first entry recorded in this monitoring session. The first line of each sample (entry) shows
   the fixed header. The fields are as follows:

   	- PID     : the identity of the process that generated the event
	- CPU     : the CPU number on which the event occurred
	- STAMP   : a time stamp guaranteed to be unique in time per CPU.
	- IIP     : the value of the IP when the event occurred (DANGER, see note below)
	- OVFL    : the counter that triggered the recording of the sample (more than one possible). 
	- LAST_VAL: the last value loaded into the first counter which overflowed

   VERY IMPORTANT NOTE:
   Users are advised NOT TO TRUST the value reported in IIP.  Samples get recorded by forcing a counter overflow 
   and which then triggers an interrupt which will cause the kernel to record the information. Because of the
   parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes
   that there was a counter overflow and generates the interrupt, the program execution has progressed way beyond
   the instruction that caused the event leading the a skewed IIP. At best IIP points to the next bundle given 
   that interrupts can only be delivered at bundle boundaries.

   After the header, you get the value of PMD5. This register contains the number of instructions retired for our
   example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the
   events were dispatched among the various PMDs, you can use the --with-header option (described earlier). 
   The header contains detailed machine and session description. In our case it would like as follows:

   #
   # date: Wed Nov 20 17:00:43 2002
   #
   # hostname: hpljumbo.hpl.hp.com
   #
   # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
   #
   # pfmon version: 2.0
   # kernel perfmon version: 1.0
   #
   #
   #
   # page size: 16384 bytes
   # CLK_TCK: 1024 ticks/second
   # CPU configured: 4
   # CPU online: 4
   # physical memory: 6827933696
   # physical memory available: 5598134272
   #
   # host CPUs:  4-way 800MHz Itanium (Merced, C0)
   #       PAL_A: 6.6.23
   #       PAL_B: 7.7.28
   #       Cache levels: 3 Unique caches: 4
   #       L1D:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
   #       L1I:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
   #       L2 :    98304 bytes, line  64 bytes, load_lat   6, store_lat   6
   #       L3 :  4194304 bytes, line  64 bytes, load_lat  21, store_lat  21
   #
   #
   # captured events:
   #       PMD4: CPU_CYCLES, user level(s)
   #       PMD5: IA64_INST_RETIRED, user level(s)
   #
   # monitoring mode: per-process
   #
   #
   # instruction sets:
   #       PMD4: CPU_CYCLES, ia32/ia64
   #       PMD5: IA64_INST_RETIRED, ia32/ia64
   #
   #
   # command: ./pfmon --with-header --smpl-output-format=detailed-itanium --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null
   #
   #
   #
   #
   # kernel sampling format: 1.0
   # sampling entry size: 56
   #
   # recorded PMDs: PMD5 
   # sampling buffer entries: 2048
   #
   # short sampling rates (base/mask/seed):
   #       CPU_CYCLES 50000
   #       IA64_INST_RETIRED none
   #
   # long sampling rates (base/mask/seed):
   #       CPU_CYCLES 50000
   #       IA64_INST_RETIRED none
   #
   #
   
   Near the end of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED.
    
    Pfmon will record the value of the PMD for which the event has no sampling period defined. For our
    first example, it means that it will record the value of the PMD counting the number of instructions 
    retired. Let us look at a more complicated example using some of the Itanium specific events:

    % pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \
      -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null

    Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting
    the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES):


    entry 0 PID:18723 CPU:3 STAMP:0x23b06dc011261 IIP:0x2000000000024d40
        PMD OVFL: 4 
        PMD5 : 0x00000000000017d7
        PMD6 : 0x00000000000001de
        PMD7 : 0x0000000000000008

    Where the assignments were: 

    # captured events:
    #       PMD4: CPU_CYCLES, user level(s)
    #       PMD5: IA64_INST_RETIRED, user level(s)
    #       PMD6: L2_MISSES, user level(s)
    #       PMD7: CPU_CPL_CHANGES, user level(s)


    Using the compact format instead of the detailed one, you get results that are formatted such that they can be 
    easily parsed by other tools. The header contains the description of every
    column:

    # column  1: entry number
    # column  2: process id
    # column  3: cpu number
    # column  4: instruction pointer
    # column  5: unique timestamp
    # column  6: bitmask of PMDs which overflowed
    # column  7: initial value of PMD which overflowed
    # column  8: PMD5
    # column  9: PMD6
    # column 10: PMD7

    and the data is formatted as follows:



   When sampling, the counts printed at the end of the session are not very useful, especially for
   the counters used as sampling periods. Those should be discarded and they are NOT saved in the 
   sampling result file.


   d/ sampling in system wide mode

   Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the
   controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When
   sampling results are redirected into a file, then you get one file per CPU. If the file is called
   'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1,
   and so on.

   The --aggregate-results options also influences the way samples are saved to files. When this option is used,
   then samples are merged into a single file. In our example, they would go into 'myresults'. If you don't use
   the --smpl-no-entry-header every sample will have the CPU information. 
   
   e/ randomization of sampling periods

   Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value
   using the --smpl-periods-random option. The same mask and seed applies to both the long and short period
   for each event. Each event can have a different mask and seed. Two separate invocations of pfmon using
   the same seed and mask arguments are guaranteed to generate to same "pseudo-random" series of numbers
   allowing reproducibility.

   The sampling buffer will report the random value used for the sampling period used to generate each sample
   in the LAST_VAL field in the detailed output format, otherwise it is in one of the columns in compact modes

   In the following command, the long (and short) sampling period are initially set to 100000 and 
   we activate randomization using a seed of 5. The mask indicates that we allow the value to vary 
   between 100000 and 100255 (inclusive):

   % pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 -e cpu_cycles -- noploop 1000000000

   entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100000
   entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100005
   entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100067
   entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100181
   entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100064
   entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100212
   entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100140


   The randomization is shown in the LAST_VAL field which shows the value loaded into PMD4
   (the PMD which overflowed) for each sample. Hence, 100181 is the number of cycles elapsed
   between entry 2 and entry 3.

   Randomization is important when sampling to avoid getting in lockstep with the execution
   and thereby collecting biased results.

6/ Blocking on overflow notifications

   Whenever the sampling buffer becomes full and pfmon is notified you have
   the option of either letting the monitored program continue or block it. In both cases, monitoring
   is off during the processing of the sampling buffer. By default, pfmon lets the program continue
   its execution. It is possible to block the program using the --overflow-block option. Blocking
   the program ensures pfmon sees the entire execution. Keeping the program running ensures that 
   the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process, 
   especially on SMP systems.

7/ Excluding idle tasks in system wide sessions

   Pfmon now allows the user to exclude the idle tasks from system wide monitoring
   session. This only works with a kernel that has perfmon 1.3 or higher. Pfmon
   checks the kernel version and may abort in case the wrong version is detected.

   Linux has one idle task per cpu. This task is run when nothing else can.
   The idle task is a kernel only task with a pid if 0. The pid 0 is use for
   ALL idle tasks. They do not show up in ps or top.

   When running a system wide session, it may be useful to stop monitoring
   when the idle task is running, this way we monitor only the USEFUL execution.
   Of course, monitoring the idle task or not implies that monitoring is active
   at the kernel privilege level, i.e., when using the -k or -0 option of pfmon.
   When monitoring only at the user level, excluding the idle task has no effect.
   Similarly, excluding the idle task for a per-process session has not effect.

   For instance, here is what we get without exclusion:

   % pfmon -k --session-timeout=10 --system-wide
                 8003084826 CPU_CYCLES

   This is run on a 800MHz Itanium CPU, so 10s is 8 billions cycles.
   But if we run with exclusion:

   % pfmon --exclude-idle -k --session-timeout=10 --system-wide
                     259663 CPU_CYCLES

   This is the useful cycles for the 10s period.

8/ Further documentation

   You can find a lot of information about the Linux/ia64 kernel in the book:

   	'ia-64 linux kernel design and implementation' 
   	David Mosberger and Stephane Eranian
	Prentice Hall
	ISBN: 0130610143
	Also see http://www.lia64.org for the book's web site.

   This book contains a chapter about the IA-64 PMU, the design of the kernel perfmon subsystem
   and also a small description of pfmon.


   More detailed information about the IA-64 architecture, including the PMU can be found
   on the Intel developers' web site at:

	http://developer.intel.com/design/itanium/family/

9/ Support

   You can subscribe to the official Linux/ia64 mailing list at www.linuxia64.org.

   Alternatively, you send send me an E-mail at eranian@hpl.hp.com 

10/ Bug reports

   You can send a bug report to myself at eranian@hpl.hp.com.
   Patches are also welcomed.

12/20/2002
S.Eranian <eranian@hpl.hp.com>

