
            ------------------------------------------------------
                                    pfmon-3.0
	    A tool to collect monitoring information for Linux/ia64
            ------------------------------------------------------
		   Copyright (c) 2001-2003 Hewlett-Packard Company
		                 Stephane Eranian <eranian@hpl.hp.com>
	

Pfmon is a performance monitoring tool uniquely designed for Linux/ia64.
It is meant as a sample tool to demonstrate how to use the perfmon subsystem provided by 
the Linux/ia64 as of version 2.4.0. This tool uses the powerful IA-64 Performance Monitoring 
Unit (PMU) to do counting and sampling on unmodified binaries or for the entire system.

This document is an attempt at providing some documentation on how to use pfmon. 
The content covers pfmon-3.0 uniquely.

                -----> YOU MUST at least HAVE kernel v2.6.0-test5 <-----

1/ Introduction

   Pfmon can be used to monitor unmodified binaries in its per-process mode and it can 
   also be used to run system wide monitoring sessions. Such sessions are active across 
   all processes executing on a given CPU. Pfmon can launch a system wide session 
   on a dedicated CPU or set of CPUs in parallel. 

   Pfmon can monitor activities happening at the user and/or kernel level for both
   type of sessions. 

   Pfmon can be used to collect basic event counts. It can also be used to sample 
   program or system execution.

   In per-process mode, pfmon can monitor the first (master) process but it can also follow
   fork, vfork and even pthread-create by dynamically creating additional monitoring sessions
   on the fly.

   Pfmon can run on any IA-64 CPU model and provides the minimal features mandated
   by the IA-64 architecture but it also provides model specific extensions. For instance,
   on Itanium and Itanium2 pfmon provides support for the EAR and BTB features.

   Pfmon is based on a generic helper library called libpfm which is not included in this
   package but can be downloaded and installed separately. This version of pfmon requires
   libpfm-3.0 or higher. 

   In the remainder of this document, we describe the key options and features of pfmon
   which are available on all CPU (PMU) models. Please refer to the model specific 
   documentation for advanced features.

2/ pfmon options

   The set of command line options provided by pfmon depends on the host PMU. It is possible
   to compile pfmon for more than one PMU model and then it will auto-detect
   the host PMU and provide the corresponding set of options.

   The options common to all PMU models are as follows:

   -h, --help                           display the list of supported options

   -V, --version                        display pfmon version information

   -l[regex], --show-event-list[=regex] show list of supported events by host PMU

   -i <event>, --event-info=event       get information about a particular event.

   -u, -3, --user-level                 monitor at the user level for all events. 
   					Per-event setting is possible with --priv-levels

   -k, -0, --kernel-level               monitor at the kernel level for all events. 
   					Per-event setting is possible with --priv-levels

   -2					monitor at privilege level 2 

   -1					monitor at privilege level 1 

   -e, --events=ev1,ev2,...             select events to monitor. There should be no space
					between the events.The number of events that you can specify 
					is dependent on the underlying PMU model. Four events is typical.

   -I,--info                            show pfmon information and detect host CPU.

   --verbose                            print more information during execution.

   --debug                              enable debug prints.

   --outfile=filename                   save counts in a file.

   --append                             when used with --outfile, will open the file in append mode.

   --overflow-block                     block the monitored program on overflow 
   					notifications (per process mode only).

   --system-wide                        create a system wide monitoring session. Default session type 
   					is per process.

   --cpu-list=x,y-z,..                  list the cpu (numbers) to include in the system-wide monitoring
   					session. You can provide specific CPU number or ranges.
   					When this option is not specified, pfmon will monitor on 
					all online CPUs.

   -S format, --smpl-module-info=format display information about a sampling module.

   -t secs, --session-timeout=secs      duration of the session in seconds. 

   --smpl-outfile=filename              save samples in a file.

   --long-smpl-periods=val1,val2,...    set sampling periods for each event after user level 
   					notification.

   --short-smpl-periods=val1,val2,...   set sampling period for each event.

   --with-header                        generate a machine description header with counts and samples.

   --aggregate-results                  aggregate counts and sampling buffer outputs when running
   					system wide monitoring on multiple CPUs.

   --priv-levels=lvl1,lvl2,...          set privilege level per event. lvl can be any combination of
   				 	u or 3 (user), k or 0 ( kernel), 1 (priv level 1), 2 (priv level2 ).
					Unspecified events will get global setting which is user only by default.

   --show-time                          show real,user, and system time for the executed command.

   --us-counter-format                  print counters using commas (1,024).

   --eu-counter-format                  print counters using points (1.024).

   --hex-counter-format                 print counters in hexadecimal (0x400).

   --smpl-module=module_name            specify the sampling module to use (default: autodetect)

   --long-show-events[=regex]		display detailed information about matching events in a single (easy grep)

   --symbol-file=filename		use the ELF archive filename to look for symbols and resolve
   					samples addresses.

   --sysmap-file=filename		use the System.map filename to look for symbols and resolve
   					samples addresses.

   --check-events-only			check that the event combination is valid and exit (no measurement)

   --smpl-periods-random=mask1:seed1,... randomize both the short and long sampling period. The mask indicate
   					the significant bits to keep in the randomly generated value. The seed is
					use to initialize the pseudo-random number generator. You can use a different
					mask and seed per event.

   --resolve-addresses			resolve addresses to symbols in sampling output.

   --smpl-print-counts			print counter results when sampling (default: off).

   --reset-non-smpl-periods		reset counters not use as sampling period for each sample.

   --saturate-smpl-buffer		collect samples only up to the limit of the buffer (default: off).

   --no-cmd-out				redirect executed command output to /dev/null.

   --attach-task=pid			attach monitoring to the process specified by pid.

   --trigger-code-start-address=addr,..	start monitoring when execution reaches addr.

   --trigger-code-stop-address=addr,...	stop monitoring when execution reaches addr.

   --trigger-data-start-address=addr,..	start monitoring when data location addr is accessed.

   --trigger-data-stop-address=addr,...	stop monitoring when data location addr is accessed.

   --trigger-code-repeat		repeat code trigger start/stop (default: off).

   --trigger-data-repeat		repeat data trigger start/stop (default: off).

   --trigger-data-ro			data triggers activated only on read only access (default: read-write).

   --trigger-data-wo			data triggers activated only on write only access (default: read-write).

   --trigger-code-follow		code trigger start/stop continues across fork,vfork,pthread_create (default: off).

   --trigger-data-follow		data trigger start/stop continues across fork,vfork,pthread_create (default: off).

   --follow-fork			monitoring continues across fork, i.e., in the new process (default: off).

   --follow-vfork			monitoring continues across vfork, i.e., in the new process (default: off).

   --follow-pthread			monitoring continues across pthread_create, i.e., in the new thread (default: off).

   --follow-all				union of --follow-fork, --follow-vfork, --follow-pthread, --follow-exec (default: off).

   --follow-exec[=regex]		monitoring will continue across exec, when regular expression is provided, will follow
   					only when the cmdline matches the regular expression (default: off).
   --follow-exec-exclude[=regex]	similar to --follow-exec, except that a match will stop the monitoring (default: off).

   --exec-split-results			differentiate monitoring sessions before and after a call to exec() (default: off).

2/ Getting event information with pfmon

   The list of events supported by pfmon depends on the host PMU. You can get the list
   of supported events using the following pfmon option:

   % pfmon -l
   CPU_CYCLES
   IA64_INST_RETIRED
   IA64_TAGGED_INST_RETIRED_PMC8
   IA64_TAGGED_INST_RETIRED_PMC9
   INST_DISPERSED
   EXPL_STOPBITS
   ALL_STOPS_DISPERSED
   IA32_INST_RETIRED
   ISA_TRANSITIONS
   NOPS_RETIRED
   ....

   If you specify an argument to the -l option (no space between l and the
   argument), it is interpreted as a regular expression and all matching events 
   will be listed:

   % pfmon -ll1d
   L1D_READ_FORCED_MISSES_RETIRED
   L1D_READ_MISSES_RETIRED
   L1D_READS_RETIRED
   PIPELINE_FLUSH_L1D_WAYMP_FLUSH

   You can get more detailed information about each event using the following 
   option (shown for Itanium2):

   % pfmon -i nops_retired
   Name   : NOPS_RETIRED
   VCode  : 0x50
   Code   : 0x50
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 6  (Threshold [0-5])
   Qual   : [Instruction Address Range] [OpCode Match] 
   Group  : None
   Set    : None
   Desc   : Retired NOP Instructions

   Pfmon is case insensitive for event names. Here you see some details about the event.
   The first 4 lines are generic and provided on all PMU models even though the codes may
   vary:
  	- Code is the event code used by the PMU. 

	- Vcode is a libpfm internal event code which encapsulates the event code and other
	  information describing the type of the event. For simple events, the two codes are
	  usually identical.

	- PMD/PMC: list the counting monitors on which this event can be programmed. Not 
		   all events can necessarily be programmed on all available counting 
		   monitors. This constraint is taken care of by the libpfm library.

   The remaining information is specific to the Itanium 2 PMU.

   Even with the -i option, you can use a regular expression for the event:
   % pfmon -i'writes$'
   Name   : L2_DATA_REFERENCES_L2_DATA_WRITES
   VCode  : 0x200b2
   Code   : 0xb2
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0010
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 4  (Threshold [0-3])
   Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range] 
   Group  : L2 Cache
   Set    : 1
   Desc   : Data Read/Write Access to L2 -- count only data write and semaphore operations

   On some PMU models (currently Itanium2), the events information contains a
   text description of the event.

   Events can be specified using their code:
   % pfmon -i 0x45
   Name   : L2_INST_PREFETCHES
   VCode  : 0x45
   Code   : 0x45
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 1  (Threshold 0)
   Qual   : [Instruction Address Range] 
   Group  : None
   Set    : None
   Desc   : L2 Instruction Prefetch Requests

   Information about what each event measures can be found in the relevant CPU model specific
   micro-architecture documentation.

   The architecture requires that only two events be defined by all PMUs:

   	- CPU_CYCLES        : the number of elapsed CPU cycles.
	- IA64_INST_RETIRED : the number of instructions retired. 

   Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific
   event names may not be exactly the same, however, pfmon and especially the library it uses
   (libpfm) will always ensure that those two events can always be called by the two names listed
   above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon 
   also incorporates a generic PMU model which provides only the features defined by the 
   architecture, this includes the two events. If pfmon does not have specific support for the 
   host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find 
   out what PMU support is compiled into pfmon as follows:

   % pfmon -I
   detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
   supported PMU models: [itanium2] [itanium] [generic IA-64] 
   detected host PMU: itanium2
   supported sampling modules: [detailed-itanium2] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
   pfmlib version: 3.0
   kernel perfmon version: 2.0

   It is possible to force pfmon to operate in generic mode even though it has support for the
   host CPU using the pfmon_gen command:

   % pfmon_gen -I
   detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
   supported PMU models: [itanium2] [itanium] [generic IA-64] 
   detected host PMU: generic IA-64
   supported sampling modules: [raw-ia64] [compact-ia64] [example-ia64] 
   pfmlib version: 3.0
   kernel perfmon version: 2.0

   % pfmon_gen -i CPU_CYCLES
   Name   : CPU_CYCLES
   VCode  : 0x12
   Code   : 0x12
   PMD/PMC: [ 4 5 6 7 ]

   The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always
   checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic
   support is compiled in, then pfmon will operate in generic mode. Such feature is useful when
   moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet.

3/ Basic counting with pfmon

   In generic mode, pfmon only supports the two architected events listed
   above. For comparison, the Itanium PMU supports about  230 events and the
   Itanium2 PMU about 500.

   No instrumentation of the program is required to monitor the system or a
   single process.

   a/ simple examples

   To collect counts on a specific command, you just need to launch it via pfmon, just like
   you would do with the time or strace command:

   % pfmon ls /dev/null
   /dev/null
       620458 CPU_CYCLES

   When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events,
   you can type:

   % pfmon -e cpu_cycles,IA64_inst_Retired ls /dev/null
   /dev/null
                    436368 CPU_CYCLES       
                    513437 IA64_INST_RETIRED

   As you can see, pfmon is not case sensitive with regards to event names. More than one event
   can be measured at a time using a comma separated list of events. You MUST not have space
   after the comma. 

   If the command you want to run takes options, you can clearly distinguish the options of 
   pfmon from the options of your command using the '--' separator:

   % pfmon -e ia64_inst_retired -- ls -ial /dev/null
   3679352 crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                    557392 IA64_INST_RETIRED

   Otherwise, pfmon will interpret your command's options as pfmon options.

   b/ privilege levels

   By default, pfmon monitors only what is going at the user level
   (application level). This is true for both per-process and system wide
   mode.

   It is possible to monitor at any of the 4 privilege levels provided by IA-64.
   It is also possible to monitor at several levels at the same time by specifying
   more than one level. The levels can be specified for all events or on a per-event
   basis.

   To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3).
   To set the level for each event, the --priv-levels option must be used.
 
   By default, pfmon only measures at the user level:

   % pfmon -e nops_retired ls /dev/null

   counts the number of NOPS_RETIRED when ls is running at the user level only
   (equivalent to specifying -u or -3).

   % pfmon -k -e nops_retired ls /dev/null

   counts the number of NOPS_RETIRED when ls is running at the kernel level only.

   % pfmon -k -u -e nops_retired ls /dev/null

   counts the number of NOPS_RETIRED when ls is running at the kernel level
   or user level, i.e. all the time.

   It is possible to refine the settings on a per event basis using the
   --priv-levels option.

   % pfmon -e loads_retired,nops_retired ls /dev/null

   Both events are measured at the user level only.

   % pfmon --priv-level=u,k -e loads_retired,nops_retired ls /dev/null

   LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
   kernel level only.

   % pfmon --priv-level=,uk -e loads_retired,nops_retired ls /dev/null

   LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
   user and kernel levels.

   % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls /dev/null

   LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the
   kernel level only.


   c/ counter formats

   Pfmon can display the final counts in various formats. There are 4 formats
   defined. The default one is shown in the example above. To make is easier
   to read large numbers or to feed the number to other programs.

   By default, you the raw number:

   % pfmon ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724882 CPU_CYCLES

   --us-counter-format where the thousands, millions, billions are separated 
   with commas (US and UK style):

   % pfmon --us-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724,882 CPU_CYCLES

   --eu-counter-format  where the thousands, millions, billions are separated 
   with points (European style):

   % pfmon --eu-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724.822 CPU_CYCLES

   --hex-counter-format where the counts are shown in hexadecimal format:

   % pfmon --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                    0x00000000000b0f56 CPU_CYCLES

   d/ saving counts

   By default, the counts are printed on the controlling tty. However it is
   possible to save them in a file using the --outfile option:

   % pfmon --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   % cat b
        0x00000000000b7cad CPU_CYCLES
 
   It is possible to include a header with the results using the --with-header
   option. It will be printed on the controlling tty or saved in the output
   file. The header contains detailed information about the configuration of
   the host machine and on the monitoring session:

   % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   % cat b
   #
   # date: Thu Oct 30 14:45:16 2003
   #
   # hostname: tuff.hpl.hp.com
   #
   # kernel version: Linux 2.6.0-test7 #200 SMP Fri Oct 24 14:11:06 PDT 2003
   #
   # pfmon version: 3.0
   # kernel perfmon version: 2.0
   #
   #
   # page size: 16384 bytes
   #
   # CLK_TCK: 1024 ticks/second
   #
   # CPU configured: 0
   # CPU online    : 2
   #
   # physical memory          : 4232495104 bytes (4036.4 MB)
   # physical memory available: 3303964672 bytes (3150.9 MB)
   #
   # host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
   #	PAL_A: 0.7.31
   #	PAL_B: 0.7.40
   #	Cache levels: 3 Unique caches: 4
   #	L1D:    16384 bytes, line  64 bytes, load_lat   1, store_lat   3
   #	L1I:    16384 bytes, line  64 bytes, load_lat   1, store_lat   0
   #	L2 :   262144 bytes, line 128 bytes, load_lat   5, store_lat   7
   #	L3 :  3145728 bytes, line 128 bytes, load_lat  12, store_lat   7
   #
   #
   # captured events:
   #	PMD4: CPU_CYCLES
   #
   # privilege levels:
   #	PMD4: CPU_CYCLES = user
   #
   # monitoring mode: per-process
   #
   # instruction sets:
   #	PMD4: CPU_CYCLES = ia32/ia64
   #
   #
   # command: pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   #
   # command line     : ls -l /dev/null 
   #
   # process id       : 10672
   # thread  id       : 10672
   # parent process id: -1
   # parent thread  id: -1
   # exec count       : 1
   #
   #
   #
        0x00000000000b9176 CPU_CYCLES

   e/ hiding program output

   It is possible to redirect the output of the programs you execute under pfmon to /dev/null.
   This is desireable if you are interested in keeping a clean output and are not using an
   output file. By default, programs share the same terminal output as pfmon. You can use the
   --no-cmd-output option to eliminate the output.

   Normal output, we see the result of the "ls" command:

   % pfmon --us-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724,882 CPU_CYCLES

   Using the option, you get:

   % pfmon --no-cmd-output --us-counter-format ls -l /dev/null
                   724,882 CPU_CYCLES
   
   f/ getting timing information

   It is possible to get a time breakdown of the execution of the monitored command for
   both per-process and system-wide mode using the --show-time option. The output is similar
   to the time(1) command. For instance:

   % pfmon --show-time -e nops_retired ls /dev/null
   /dev/null
                    145158 NOPS_RETIRED
   real 0h00m00.001s user 0h00m00.000s sys 0h00m00.000s

   g/ Testing event combinations

   Sometimes it is handy to check if some events can be measured simultaneously without actually
   starting the monitoring session. The --check-events-only option of pfmon allows this mode of
   operation. It checks that the combination is valid and then exits. If the combination is
   invalid, it prints out the reason and pfmon exits with a return code of 1, otherwise  a value
   of 0 is returned. On Itanium2, for instance, you can try:

   % pfmon --check-events-only -e loads_retired,stores_retired
   event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time
   % echo $?
   1

   Note that in this mode, you do not need to specify a command to execute.

4/ Monitoring running tasks

   This version of pfmon can "attach" to an already running process, monitor for a while
   and eventually "detach" from the process. This is very useful when monitoring system
   daemons or workload that are very long to start. The process to attach to is specified
   using the --attach-task option with a process id as reported by the ps command. You must
   have permission to access the process for this option to work. It is not possible to attach
   to kernel threads, such as kswapd, migration, khubd, kjournald even if you are root. 

   Let us look at some examples. Suppose we have this long running program called noploop and
   we want to monitor it once it is started:

   % noploop 1000000000000000 &
   % ps ax | fgrep noploop
   22731 pts/7    R      1:33 nueh/noploop 1000000000000000
   % pfmon --attach-task=22731
   session interrupted by user
              120106020603 CPU_CYCLES


   This pfmon command attaches to process 22731 and collects CPU_CYCLES in user level until the process
   terminates. It is possible to interrupt if you want to stop before the process terminates by 
   simply  pressing CTRL-C (control C) on your keyboard. This prints the counts so far and detaches
   from the program which will continue to run.

   It is possible to use a session timeout to delimit the duration of such measurement using the
   --session-timeout option:

   % pfmon --session-timeout=5 --attach-task=22731
   session timed out
                4988897963 CPU_CYCLES

   It is not possible to use the --session-timeout option with any of the -follow-* option is used.
   But the --attach-task option can certainely be combined with any of the --follow-* options making
   it very easy to monitor daemons that fork off dedicated processes such as xinetd.

5/ System wide sessions

   When the --system-wide option is used, pfmon operates in system wide mode. This means that
   it does not monitor a specific program anymore but instead all the processes that execute
   on a specific set of CPUs. In this mode, you do no need to specify a command. You do NOT 
   need to be root to create a system wide session.

   A system wide session cannot co-exist with any per-process sessions. But a system wide session
   can run concurrently with other system wide sessions as long as they do not monitor the same
   set of CPUs. 

   a/ selecting CPUs to monitor

   This version of pfmon supports up to 1024 processors. By default, pfmon monitors on ALL online
   CPUs. It is possible to restrict the set of CPUS using the --cpu-list option. 

   If the system has 2 available CPUS:

   % pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                       55797 CPU_CYCLES       
   CPU0                       61656 IA64_INST_RETIRED
   CPU1                       10577 CPU_CYCLES       
   CPU1                       10776 IA64_INST_RETIRED

   
   A system wide session can monitor at any privilege level (kernel, user, or both).

   If you want to restrict to a specific CPU, you can use the --cpu-list command:

   % pfmon --system-wide --cpu-list=1 -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU1                        1869 CPU_CYCLES       
   CPU1                         512 IA64_INST_RETIRED

   The CPU are numbered starting at 0. You can specify multiple CPUs using a comma
   separated list withous spaces:

   % pfmon --system-wide --cpu-list=0,1 -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                       48569 CPU_CYCLES       
   CPU0                       61619 IA64_INST_RETIRED
   CPU1                       10444 CPU_CYCLES       
   CPU1                       10776 IA64_INST_RETIRED

   It is also possible to specify ranges of CPUs:

   % pfmon --system-wide --cpu-list=0-1 -u -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                       48569 CPU_CYCLES       
   CPU0                       61619 IA64_INST_RETIRED
   CPU1                       10444 CPU_CYCLES       
   CPU1                       10776 IA64_INST_RETIRED

   Ranges and individual specification can be combined in a single comma-separated list:

   % pfmon --system-wide --cpu-list=0,2-3 -u -e cpu_cycles,ia64_inst_retired

   This above example monitors on CPU0, CPU2, and CPU3.

   b/ delimiting a system wide session

   There are three ways to delimit a system wide session. By default, the
   session will terminate when the user press the <Enter> key.  It is also
   possible to use a timeout expressed in seconds. Finally, the session can 
   also be delimited by the execution of a command. It will start when the
   command starts and stops when it terminates. Here are some examples:


   Monitor cpu_cycles and instruction retired on the first two CPUs at both 
   user and kernel levels and wait for a keypress to stop:

   % pfmon --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
   CPU0                   325728352 CPU_CYCLES       
   CPU0                   853076156 IA64_INST_RETIRED
   CPU1                   325869036 CPU_CYCLES       
   CPU1                   853618095 IA64_INST_RETIRED


   Monitor cpu_cycles and instruction retired on the first two CPUs at both 
   user and kernel levels for 10 seconds, use US counter format:

   % pfmon --us-c --session-timeout=10 --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <session to end in 10 seconds>
   CPU0              10,000,683,919 CPU_CYCLES       
   CPU0              26,189,007,084 IA64_INST_RETIRED
   CPU1              10,000,799,060 CPU_CYCLES       
   CPU1              26,204,014,060 IA64_INST_RETIRED

   Monitor cpu_cycles and instruction retired on the first two CPUs at the
   user level only during the execution of the ls command (here obviously run
   on CPU0):

   % pfmon --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   CPU0                      802960 CPU_CYCLES       
   CPU0                      581288 IA64_INST_RETIRED
   CPU1                       44363 CPU_CYCLES       
   CPU1                       10666 IA64_INST_RETIRED

   c/ results aggregation

   It is possible to aggregate counts when monitoring more than one CPU:

   % pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired
   <Press ENTER to stop session>
                4577585811 CPU_CYCLES       
               11991877430 IA64_INST_RETIRED

   In which case, the per CPU results are summed. Pfmon does not allow  different events to be
   monitored on different CPUs. For this you can run separate instances of pfmon with a different
   CPU list, using a command lines similar to:

   % pfmon --session-timeout=10 --cpu-list=0 --system-wide -k -e cpu_cycles &
   % pfmon --session-timeout=10 --cpu-list=1 --system-wide -k -e ia64_inst_retired &

   d/ delayed start

   Sometimes, it may be useful to delay the activation of monitoring until
   a certain point in the execution is reached. This is the case when the
   initialization must not be included in the counts. 
   
   In system-wide mode only, pfmon provides the --trigger-delay-start option
   which can be used to specify a delay in second before monitoring is activated:

   % pfmon --system-wide --trigger-start-delay=5 

   Wait for 5 seconds before acitvating monitoring on all CPUs. You see the
   line:
   	<press ENTER to stop session>

   when it is actually started.

   The --trigger-start-delay option is not support in per-process mode.

   e/ saving results in files

   Results can be saved into file(s) for system-wide session using the same 
   --outfile option as for per-process sessions. When results are not aggregated
   you get one file per cpu. The filename provided via the option is preprended
   to .cpuX where X is the CPU number:

   % pfmon --outfile=b --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null

   % cat b.cpu0
                       774233 CPU_CYCLES       
                    557904 IA64_INST_RETIRED
   % cat b.cpu1
                        54799 CPU_CYCLES       
                     32728 IA64_INST_RETIRED


   You can add a header in each file by specifying the --with-header option.

6/ Monitoring multiple processes/threads

   This feature of pfmon-3.0 is only meaningful in per-process mode.

   a/ basic monitoring of multiple processes/threads

   When monitoring complicated workloads it is useful to be able to monitor
   more than just the first process or thread which typically only do the initialization
   the bulk of the work being carried out by worker processes/threads. To illustrate the
   various options offered by pfmon we use the example of the GNU C compiler gcc invoked
   to compile a trivial "hello.c" program. Several processes are created as part of the
   compilation: 

   % gcc -v hello.c -o hello -O2 -static
   Reading specs from /usr/lib/gcc-lib/ia64-linux/2.96/specs
   gcc version 2.96 20000731 (Debian GNU/Linux IA64 experimental)
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -v -D__GNUC__=2 -D__GNUC_MINOR__=96 -D__GNUC_PATCHLEVEL__=0 -D__ia64 -D__ia64__ -D__linux -D__linux__ -D_LONGLONG -Dlinux -Dunix -D__LP64__ -D__ELF__ -D__ia64 -D__ia64__ -D__linux -D__linux__ -D_LONGLONG -D__linux__ -D__unix__ -D__LP64__ -D__ELF__ -D__linux -D__unix -Asystem(linux) -Acpu(ia64) -Amachine(ia64) -D__OPTIMIZE__ -D__LONG_MAX__=9223372036854775807L hello.c /tmp/ccM0VFYv.i
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccM0VFYv.i -mb-step -quiet -dumpbase hello.c -O2 -version -o /tmp/ccas6Y5Q.s
   ...
   as -x -o /tmp/ccMo7exq.o /tmp/ccas6Y5Q.s
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/../../../crt1.o /usr/lib/gcc-lib/ia64-linux/2.96/../../../crti.o /usr/lib/gcc-lib/ia64-linux/2.96/crtbegin.o -L/usr/lib/gcc-lib/ia64-linux/2.96 -L/usr/lib/gcc-lib/ia64-linux/2.96/../../.. /tmp/ccMo7exq.o -lgcc -lc -lgcc /usr/lib/gcc-lib/ia64-linux/2.96/crtend.o /usr/lib/gcc-lib/ia64-linux/2.96/../../../crtn.o

   The edited output reveals that gcc uses a total of 5 processes to compile hello.c.

   Now, let's look at how pfmon can monitor all or some of the processes involved here.

   By default only the first process is monitored:

   % pfmon --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                 1,123,406 IA64_INST_RETIRED

   How do we know that only one process is monitored? You can enable the verbose mode of pfmon:

   % pfmon --verbose --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   symbol file for gcc not found
   measuring at user privilege level ONLY
   [pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   [pmd4 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 rand=n]
   exec-pattern=*
   [10950] started task: gcc hello.c -o hello -O2 -static 
   follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
   [10950] monitoring gcc hello.c -o hello -O2 -static ...
   [10950] activating monitoring
   measurements started at Thu Oct 30 17:10:58 2003
   
   [10950] task exited
                    1,123,406 IA64_INST_RETIRED
   [10950] detached
   created tasks        : 1
   maximum tasks        : 1
   maximum active tasks : 1
   measurements completed at Thu Oct 30 17:10:59 2003


   Now if we want to monitor all processes, we can use some of the --follow-* option that pfmon offers:
   
      --follow-fork   : pfmon monitors the child process created via a regular fork(). Pfmon creates
      		        a distinct "session" dedicated to the newly created process. By default pfmon
			does not monitor the child process.

      --follow-vfork  : pfmon monitors the child process created via a vfork(). Pfmon creates
      		        a distinct "session" dedicated to the newly created process. By default pfmon
			does not monitor the child process.

      --follow-pthread: pfmon monitors the a new pthread is created via pthread_create(). This works for both the
      			old LinuxThreads library and the New Posix Thread Library (NPTL). Pfmon creates
      		        a distinct "session" dedicated to the newly created process. By default pfmon
			does not monitor the new thread.

      --follow-exec   : pfmon monitors the process across the execve() system call. No distinct "session" is created.
      			By default, pfmon does not monitoring across exec().

      --follow-all    : combines all of the above options.

   In case you are not sure which variation of the fork to use, it is best to use --follow-all. Now we run our gcc
   test case with the --follow-all option:

   % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                   619,470 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (10973,10973,10971)
                20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (10974,10974,10971)
                 5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
               228,937,523 IA64_INST_RETIRED /usr/bin/ld (10977,10977,10976)
                   468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (10976,10976,10971)
                 1,123,415 IA64_INST_RETIRED gcc (10971,10971,-1)

   This time, pfmon reports 6 different results for 6 different processes. But we said there was 5 only?
   The collect2 process forks the linker ld and this is not reported by the -v option of gcc.

   Let's examine a line of the output:

                 5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
		 ^         ^                 ^   ^     ^     ^
                 |         |                 |   |     |     |
		 |         |                 |   |     |     parent process id
		 |         |                 |   |     thread id (tid)
		 |         |                 |   process id (pid)
		 |         |                 command name
		 |         event name
		 total count for the process

   A parent process id of -1 indicates the first process.

   It is important to keep in mind that for child processes, by default, the "total count" does 
   include execution up to the initial fork, i.e., before the exec:

    m   | ->     vfork() ----> parent goes to wait()
    o   |          |
    n   |          |
    i   |          |
    t   |        execve()
    o   |          |
    r   |        execution of the command
    e   |          |
    d   |->       exit()

   The pid vs. tid distinction is only relevant for threads AND when using NPTL. Otherwise both
   numbers should be identical. The pid, tid, parent pid can be used to re-create the process 
   tree involved with the execution of gcc.  Results are reported in order of termination of the 
   process. This is why "gcc" is reported last. The entire command line is not reported, only 
   the first argument, which is the command name.

   You can trace what is going on with the --verbose option:

   % pfmon --verbose --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   symbol file for gcc not found
   measuring at user privilege level ONLY
   [pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   [pmd4 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 rand=n]
   exec-pattern=*
   [11009] started task: gcc hello.c -o hello -O2 -static 
   follow_exec=y follow_vfork=y follow_fork=y follow_pthread=y
   [11009] monitoring gcc hello.c -o hello -O2 -static ...
   [11009] activating monitoring
   measurements started at Thu Oct 30 17:41:49 2003
   
   [11009] vfork [11011]
   [11011] monitoring gcc hello.c -o hello -O2 -static ...
   [11011] activating monitoring
   [11011] exec /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -D__GNUC__=2 -D__G...
   [11011] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -D__GNUC__=2...
   [11011] task exited
                   619,469 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11011,11011,11009)
   [11011] detached
   [11009] vfork [11012]
   [11012] monitoring gcc hello.c -o hello -O2 -static ...
   [11012] activating monitoring
   [11012] exec /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccOmxEc2.i -mb-step -q...
   [11012] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccOmxEc2.i -mb-s...
   [11012] task exited
                20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11012,11012,11009)
   [11012] detached
   [11009] vfork [11013]
   [11013] monitoring gcc hello.c -o hello -O2 -static ...
   [11013] activating monitoring
   [11013] exec as -x -o /tmp/ccE7nXIP.o /tmp/cciqWksW.s ...
   [11013] monitoring as -x -o /tmp/ccE7nXIP.o /tmp/cciqWksW.s ...
   [11013] task exited
                 5,498,296 IA64_INST_RETIRED as (11013,11013,11009)
   [11013] detached
   [11009] vfork [11014]
   [11014] monitoring gcc hello.c -o hello -O2 -static ...
   [11014] activating monitoring
   [11014] exec /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/...
   [11014] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello...
   [11014] vfork [11015]
   [11015] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello...
   [11015] activating monitoring
   [11015] exec /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/.....
   [11015] monitoring /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2...
   [11015] task exited
               228,937,523 IA64_INST_RETIRED /usr/bin/ld (11015,11015,11014)
   [11015] detached
   [11014] task exited
                   468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11014,11014,11009)
   [11014] detached
   [11009] task exited
                 1,123,406 IA64_INST_RETIRED gcc (11009,11009,-1)
   [11009] detached
   created tasks        : 6
   maximum tasks        : 3
   maximum active tasks : 3
   measurements completed at Thu Oct 30 17:41:49 2003
   
   The trace reveals that vfork() is used to create the new processes, therfore we could have used
   --follow-vfork and --follow-exec to obtain the same results.

   b/ saving results in files

   It is possible to use the --outfile option when you are using any of the --follow-* options.
   There will be one file per process/thread, really per task in Linux terms unless you decide
   to aggregate the results using the --aggregate-results options.

   Using our gcc compilation example, we can now do:

   % pfmon --outfile=result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   $ ls -l result.*
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1931.1931
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1933.1933
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1934.1934
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1935.1935
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1936.1936
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1937.1937

    There is one file per process created. The naming convention used by pfmon is as follows:

       result.1937.1937
       ^      ^    ^
       |      |    |
       |      |    thread id
       |      process id
       base filename

   Unless  you are using threads (LinuxThreads or NPTL) both the process id and the thread id should be identical.
   You can use the --with-header option to get detailed header in each output file.

   % pfmon --with-header --outfile=result --follow-all --us-c --no-cmd-out \
     -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   % ls -l result.*
   -rw-r--r--    1 eranian  users        1416 Nov  4 15:32 result.2791.2791
   -rw-r--r--    1 eranian  users        1816 Nov  4 15:32 result.2793.2793
   -rw-r--r--    1 eranian  users        1497 Nov  4 15:32 result.2794.2794
   -rw-r--r--    1 eranian  users        1428 Nov  4 15:32 result.2795.2795
   -rw-r--r--    1 eranian  users        1790 Nov  4 15:32 result.2796.2796
   -rw-r--r--    1 eranian  users        1760 Nov  4 15:32 result.2797.2797
   % cat result.2797.2797
   ...
   #
   # command line     : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/../../../crt1.o ...
   #
   # process id       : 2797
   # thread  id       : 2797
   # parent process id: 2796
   # parent thread  id: 2796
   # exec count       : 1

   The "command line" line shows the program for this task with represents the invocation of the linker, ld. The 
   following set of lines shows the pid/tid and parent information.

   c/ separating execution between fork()/vfork() and exec()

   pfmon can differentiate the execution that happens between thre fork/vfork
   and exec using the --exec-split-result option. 

   If we use our gcc example from the previous section, here is what happens:

   Without the option:

   % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                   619,470 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (10973,10973,10971)
                20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (10974,10974,10971)
                 5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
               228,937,523 IA64_INST_RETIRED /usr/bin/ld (10977,10977,10976)
                   468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (10976,10976,10971)
                 1,123,415 IA64_INST_RETIRED gcc (10971,10971,-1)

   With the option:

   % pfmon --exec-split-result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                        94 IA64_INST_RETIRED gcc (11041,11041,11039)
                   619,376 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11041,11041,11039)
                        94 IA64_INST_RETIRED gcc (11042,11042,11039)
                20,312,514 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11042,11042,11039)
                     2,203 IA64_INST_RETIRED gcc (11043,11043,11039)
                 5,496,093 IA64_INST_RETIRED as (11043,11043,11039)
                        94 IA64_INST_RETIRED gcc (11044,11044,11039)
                       154 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11045,11045,11044)
               228,937,369 IA64_INST_RETIRED /usr/bin/ld (11045,11045,11044)
                   468,234 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11044,11044,11039)
                 1,123,406 IA64_INST_RETIRED gcc (11039,11039,-1)

   This time we have more counts reported, but that does not mean we had more processes created just because of
   our measurement. It simply shows a finer grain breakdown of the same measurement. It is fairly easy to rebuild
   the process tree:
                 gcc (11039,11039,-1)
       vfork ->        gcc (11041,11041,11039)
       exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11041,11041,11039)

       vfork ->        gcc (11042,11042,11039)
       exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11042,11042,11039)

       vfork ->        gcc (11043,11043,11039)
       exec  ->        as (11043,11043,11039)

       vfork ->        gcc (11044,11044,11039)
       exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11044,11044,11039)
       vfork ->            /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11045,11045,11044)
       exec ->        	   /usr/bin/ld (11045,11045,11044)

   This measurement reveals that gcc does not do anything between vfork and exec for the pre-processor, and cc1 but
   does more before invoking as.

   When results are saved into a file, pfmon automatically adds an extra field to the file name when the 
   --exec-split-results option is used. This is necessary to distinguish between the various parts of the
   execution of the task because the pid and the tid remain the same across exec:

   % pfmon --exec-split-result --outfile=qq --follow-all --us-c --no-cmd-out \
     -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   % ls -l qq.*
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20602.20602.1
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20604.20604.0
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20604.20604.1
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20605.20605.0
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20605.20605.1
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20606.20606.0
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20606.20606.1
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20607.20607.0
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20607.20607.1
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20608.20608.0
   -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20608.20608.1

   Notice, the third field in the filenames:

       qq.20604.20604.1
       ^  ^     ^     ^
       |  |     |     |
       |  |     |     exec number
       |  |     thread id
       |  process id
       base filename

   The process 20604 did exec once. The qq.20604.20604.0 contains the results for the initial portion of 
   the execution of the process, i.e., before the first exec. The file qq.20604.20604.1 contains the results
   for the portion of execution after the first exec. 

   As for the initial process, gcc in this case, it shows up as qq.20602.20602.1. The exec count is always
   1 for this program because we count the initial exec from the pfmon tool (after the vfork).

   It is very helpful to use the --with-header option to record detailed information about the results.
   For instance, in the example above:

   % cat qq.20604.20604.0
   ...
   # command line     : gcc hello.c -o hello -O2 -static
   #
   # process id       : 20604
   # thread  id       : 20604
   # parent process id: 20602
   # parent thread  id: 20602
   # exec count       : 0
   #
   ...
   
   This represents the counts accumulated between the moment the parent gcc forked off the child 
   and the moment the child called exec.

   Looking at the other file:

   % cat qq.20604.20604.1
   ...
   # command line     : /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -D__GNUC__=2 -D__GNUC_MINOR__=96 ...
   #
   # process id       : 20604
   # thread  id       : 20604
   # parent process id: 20602
   # parent thread  id: 20602
   # exec count       : 1
   ...
   This corresponds to the execution of the pre-processor, cpp0, with is exec'ed by the child of gcc.
   Notice the exec count is now 1.

   d/ selectively monitoring processes

   When using the --follow-exec option it is possible to specify a regular expression pattern to filter out
   some part of execution. This is best explain with an example. If we use our gcc example once again and
   suppose we only care about the results for the assembler, "as", then we can use the following:

   % pfmon --follow-all  --follow-exec="^as" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                    5,496,093 IA64_INST_RETIRED as (11065,11065,11061)

   The "^" before as is to ensure that we match the pattern "as" only at the beginning of the command line.

   Pfmon only measures and reports execution of the "as process". Of course, here one has to be careful by what is meant
   by "as process". As we saw in the previous section, the "as process" starts its life as a "gcc process" that eventually
   execs the assembler, as. In this particular example, pfmon did not have monitoring enabled at the time the "gcc process"
   is created, therefore monitoring ONLY starts at the exec.

   If you want to include more than one command and cannot easily express this with a regular expression, you can use
   the opposite command to --follow-exec which is --follow-exec-exclude. Commands that match are not monitored.
   In our gcc example, suppose we want to measure  everything except gcc:
   
   % pfmon --follow-all  --follow-exec-exclude="^gcc" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                   619,376 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11081,11081,11079)
                20,312,513 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11082,11082,11079)
                 5,496,093 IA64_INST_RETIRED as (11083,11083,11079)
               228,939,036 IA64_INST_RETIRED /usr/bin/ld (11085,11085,11084)
                   468,234 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11084,11084,11079)

   These options may seem frivolous at first because a simple grep of the full output would yield the same results but
   they become particularly useful when pfmon is used for sampling. Also by not monitoring when not necessary we limit
   the overhead.

   e/ aggregating results from multiple processes/threads

   By default, pfmon produces output for every actively monitored task followed. You may want to only
   get the aggregate results. In this case, you simply need to use the --aggregate-results and pfmon
   will add up all the counts per event for all tasks.

   Let us take our compilation example and use the LOADS_RETIRED event to gain some more stability in the counts:

   % pfmon --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
                     80598 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (17979,17979,17977)
                   1818537 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (17980,17980,17977)
                    535503 LOADS_RETIRED as (17981,17981,17977)
                  25453153 LOADS_RETIRED /usr/bin/ld (17983,17983,17982)
                     61333 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (17982,17982,17977)
                    115971 LOADS_RETIRED gcc (17977,17977,-1)

  Now with aggregation:

   % pfmon --aggregate-results --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
                  28065993 LOADS_RETIRED

7/ Triggering monitoring at specific locations

   Pfmon allows you to control where monitoring is started and stopped using code or data trigger addresses. By
   default pfmon starts monitoring at the first instruction of a program and stop when the programs terminates, i.e.,
   the process exits. This feature exists on all platform and does not use the PMU hardware but the debug registers.
   As such it introduces more overhead and should be use carefully, i.e. the section of the execution where monitoring
   is active must be long enough compared to the overhead.

   This ability to control where monitoring starts/stops is only available in per-process mode.

   a/ basic code triggers

   You can start or stop monitoring when execution reaches a certain instruction in your program.
   The instruction is identified by its address. For code, the address must be aligned on a bundle
   boundary (16-byte aligned). There is no particular checks made on the address, pfmon does not 
   try to validate the address inside the process user level address space. However pfmon rejects
   all addresses refering to kernel code. This feature cannot be used to trigger monitoring for
   kernel level execution. This limitation is due to the fact that triggers are implemented using
   the debug registers and it is illegal to set a breakpoint in the kernel.

   You can set up to four code trigger start and stop. For instance you can have two start triggers
   and two stop triggers, but they do not need to be evenly balanced, i.e., one start and
   three stops is perfectly legal. You specify the start triggers using --trigger-code-start followed
   by a comma separated list of addresses or symbol names. Do not use space between the element of the
   list. Similarly stops are expressed using --trigger-code-stop as a comma-separated list of addresses
   or symbols.

   Supposing we set the start trigger at the beginning of a function, and if this function calls
   others, monitoring will stay active until we come across a stop trigger or the process terminates.

   By default start and stop triggers are "executed" only once.

   Let us look at a simple example with a stupid noploop program:

   % nm noploop | fgrep "T main"
   40000000000004c0 T main

   Count the number of load retired instruction for the entire execution:

   % pfmon --no-cmd-out -eloads_retired noploop 100000
                      1340 LOADS_RETIRED

   Count the number of load retired instruction starting at main():

   % pfmon --trigger-code-start=0x40000000000004c0 --no-cmd-out -eloads_retired noploop 100000
                       552 LOADS_RETIRED

   Monitoring start at the first instruction in main() and stop when noploop executes its last user level
   instruction (we only monitor at the user level here). From the two runs, we can figure out the number
   of loads executed by the startup code: 1340-552 = 788. We can verify this by simply stopping when we
   reach main():

   % pfmon --trigger-code-stop=0x40000000000004c0 --no-cmd-out -eloads_retired noploop 100000
                       788 LOADS_RETIRED

   As you can see here, you do not need to match a start with a stop.

   Instead of using addresses, you can use symbols names given that the symbol table is still
   available in the binary. So our last example can be expressed as follows:

   % pfmon --trigger-code-stop=main --no-cmd-out -eloads_retired noploop 100000
                       788 LOADS_RETIRED

   b/ data triggers

   Data triggers work very much like ode triggers but they apply only to data accesses, i.e, 
   load, store, synchronization instructions. By default the trigger operates on read or
   write access. The start triggers are specified with the --trigger-data-start option 
   and the stop triggers with the --trigger-data-stop option. Both are comma-separated lists
   of data addresses or symbols. The trigger is activated when the byte pointed to by the address
   is accessed as part of any size of access, it could be a ld1, ld2, ld4, ld8 for instance.

   When using symbols, the names refer to global variables in the main program when linked
   dynamically or for the entirety of the program when linked statically, i.e., no dynamic
   symbol lookup is performed.

   It is possible to refine the type of access that we active the triggers. By default
   they are activated on read OR write. You can restrict this with the following options:

   	--trigger-data-ro : activate on read access only
	--trigger-data-wo : activate on write access only

   c/ repeating triggers

   In certain cases it is useful to repeat the start/stop trigger sequence each time
   the program comes across the trigger points. In the case of a code trigger, that is how
   you could accumulate a measurement specific to a function, for instance. By default pfmon
   only uses each trigger once. You can repeat the code triggers using --trigger-code-repeat
   and ALL code start/stop trigger will be repeated. Similarly using --trigger-data-repeat
   will keep the data trigger installed throughout the execution.

   d/ inheritance of triggers

   By default, triggers are not inherited across fork,vfork,pthread_create. Triggers
   are NEVER inherited across exec(). By inheritance, we mean are not reinstalled into
   the newly created task. Trigger inheritance is possible using the --trigger-code-follow
   and --trigger-data-follow options.

   e/ dynamic code stop triggers

   Many times, code trigger are used to measure only one function and the one it calls. 
   However many function can be called from difference places and it is often hard to
   locate the return point where the stop trigger should be placed. Pfmon can dynamically
   figure out the return point of a function. When the code start and stop trigger are
   identical, pfmon interpret the options as a request to monitor when execution comes
   into a function until it leaves the function, i.e, return from the function. Pfmon
   does this using a best-effort approach that is not 100% guaranteed. Pfmon does not
   check that you place the start trigger at the beginning of a function. To determine
   the location where to place the stop trigger, pfmon uses the value store in the
   return pointer (rp) at the time execution reaches the start trigger. Hence, if
   the start trigger is placed at the first instruction of the function, then 
   the rp contains the place when the function will eventually return to. Note that
   this location is not necessarily one bundle after where it was called, especially
   for leave procedures. This feature is not 100% guaranteed and as such is reserved
   to the experts.

   Here is a simple example using our noploop program:
   % pfmon --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
                            0 LOADS_RETIRED

   The assembly code for the noploop function inside our noploop program is as follows:

   4000000000000390 <noploop>:
   4000000000000390:       [MII]       nop.m 0x0
   4000000000000396:                   mov.i r2=ar.lc
   400000000000039c:                   mov.i ar.lc=r32;;
   40000000000003a0:       [MIB]       adds r32=-1,r32
   40000000000003a6:                   nop.i 0x0
   40000000000003ac:                   br.cloop.sptk.few 40000000000003a0 <noploop+0x10>;;
   40000000000003b0:       [MIB]       mov r8=r32
   40000000000003b6:                   mov.i ar.lc=r2
   40000000000003bc:                   br.ret.sptk.many b0;;

   There is no load instruction in this function and notice that there is no alloc instruction either.
   noploop() is called from main(). To figure out what is going on, you can turn on verbose mode:

   % pfmon --verb --no-cmd-out --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
   loaded 1353 symbols from ELF file noploop
   measuring at user privilege level ONLY
   start code trigger @0x4000000000000390
   stop  code trigger @0x4000000000000390
   [pmc5=0x00cd28 thres=0 es=0xcd plm=8 umask=0x0 pm=0 ism=0x0 oi=1] LOADS_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   [pmd5 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 rand=n]
   exec-pattern=*
   [12760] started task: noploop 100000000 
   follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
   [12760] monitoring noploop 100000000 ...
   [12760] installed start code breakpoint at 0x4000000000000390
   measurements started at Fri Oct 31 17:18:52 2003

   [12760] reached start code breakpoint @0x4000000000000390
   [12760] install br.ret breakpoint @0x40000000000004e0
   [12760] activating monitoring at trigger start
   [12760] clearing code breakpoint @0x4000000000000390
   [12760] reached stop  code breakpoint @0x40000000000004e0
   [12760] stopping monitoring at trigger stop
   [12760] clearing code breakpoint @0x40000000000004e0
   [12760] task exited
                         0 LOADS_RETIRED
   [12760] detached
   created tasks        : 1
   maximum tasks        : 1
   maximum active tasks : 1
   measurements completed at Fri Oct 31 17:18:52 2003

   The key line in the trace is the one that shows the dynamically extracted stop trigger:

      [12760] install br.ret breakpoint @0x40000000000004e0

   When the start trigger is reached at 0x4000000000000390, i.e., the beginning of noploop(),
   pfmon queries the kernel to get the return pointer (rp), and it gets 0x40000000000004e0 which
   is the address in main() where noploop() will return to:

   40000000000004cc:                   mov b6=r15
   40000000000004d0:       [MIB]       ld8 r1=[r14]
   40000000000004d6:                   nop.i 0x0
   40000000000004dc:                   br.call.sptk.many b0=b6;;
   40000000000004e0:       [MII]       mov r1=r38

   In this program noploop() is called via an indirect function call.

   Of course, this feature relies upon having a valid return pointer at the
   time we reach the start trigger. Pfmon does not verify this at all, so be
   careful when using this feature!

8/ Dealing with symbols

    Whenever an option takes an address (code or data) as argument, it is
    possible to directly use a symbol name rather than an address.
    For instance, this is true for the --trigger-code-*, --trigger-data-* 
    options. Other examples include the Itanium2 options, --drange, --irange
    options.
    
    a/ symbol table sources

    Pfmon can extract the symbol table from an ELF image directly. 
    Pfmon also understands the System.map format which is typically used 
    to save the symbol table of the kernel. In addition pfmon now understands 
    the format of the kernel symbols exported via /proc/kallsyms which is 
    standard in the 2.6 kernel series. Note however, that this file does not 
    list the kernel data symbols but it has the advantage of always being 
    available and it is always matching the host kernel.
    
    b/ symbol lookups

    In per-process mode, the symbols are automatically extracted from the 
    command being run. When the command does not have the symbol table it
    is possible to specify an alternate ELF file to look for the symbol, using
    the --symbol-file command. This other ELF file must be the non-stripped version 
    of the command being run, of course!

    There are a few restrictions concerning symbols in per-process mode. Pfmon cannot
    extract symbol information coming from dynamically linked libraries or modules. 
    To work around this problem, the program must be statically linked and should 
    not explicitely use dl_open(). 

    If the symbol table has been stripped, pfmon will not find any symbol and you should
    revert to using a numerical address..

    In case the option requires a code address, pfmon will only look for matching 
    code symbols. Conversly, if the option requires a data address, pfmon will only 
    look for matching data symbols. The kernel symbols exported via /proc/kallsyms
    do no include data symbols, use System.map for this instead.

    For the kernel, pfmon automatically load the content of /proc/kallsyms, unless
    the user specified the --sysmap-file option which supersedes it. The option
    must point to a valid System.map file that matches the host kernel.

    b/ symbols and the --follow-* options

    In per-process mode, symbol lookups are done before the command is actually launched
    therefore it pertains only to the initial program. Hence, it is not possible to
    refer to the name of a symbol from a program that will eventually execute. For instance,
    it is not possible to name a symbol from the assember "as" when monitoring a compilation
    with GNU c (gcc). 

    c/ symbol resolution when sampling

    It is possible to resolve all addresses present in samples when using the  
    --resolve-addresses option AND when the sampling module honors the option.
    The [detailed-itanium] and [detailed-itanium2] builtin sampling modules do
    honor the option but the [compact-ia64] does not.

    With this option, pfmon will try to resolve addresses it finds in each sample.
    This includes the IIP but possibly other addresses contained in model-specific 
    PMDs. For instance, pfmon resolves the address contained in PMD17 for D-EARS 
    on Itanium and Itanium2. Resolution occurs for kernel level symbols and 
    the user application if a symbol table was found. When no matching symbol
    exists, the raw address is printed instead. The location of the symbol,
    either kernel or application (app) is prefixed to the symbol name.

    In system-wide, only kernel symbols are resolved.

    d/ symbol resolution examples

    Here are a few examples on Itanium or Itanium2:

    Count the number of time main() is called in the noploop program:

    % file noploop
    noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
    % pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop 10000
                         3 IA64_INST_RETIRED

    Here the symbol information for main() is directly extracted from noploop
    itself. You must divide by 3 (becuase of 3 were executed in the first bundle) to 
    get the actual result.

    Count the number of time main() is called in the noploop-stripped program:

    % file noploop-stripped
    noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped

    % pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop-=stripped 10000
    symbol not found: main

    Let us provide the non-stripped version for the symbol table:

    % pfmon --no-cmd-out --symbol-file=noploop --checkpoint-func=main -e ia64_inst_retired noploop-stripped 1000
                             3 IA64_INST_RETIRED

    Here noploop and noploop-stripped are the same programs except that the latter does not have 
    the symbol table anymore.

    Count the number of times sys_getpid() is called during the execution of the noploop program:

    % pfmon -k --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000
                             0 IA64_INST_RETIRED

    The -k is required to make sure we monitor at the kernel level where sys_getpid() is located. The
    results shows that noploop never calls getpid(). We can easily verify that this is correct using
    the strace program:

    % strace noploop 1000
    execve("./noploop", ["noploop", "100000"], [/* 24 vars */]) = 0
    uname({sys="Linux", node="tuff.hpl.hp.com", ...}) = 0
    brk(0)                                  = 0x6000000000010000
    brk(0x6000000000034000)                 = 0x6000000000034000
    fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
    mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000000000
    write(1, "143663 cycles, 1.44 cycles/loop\n", 32143663 cycles, 1.44 cycles/loop) = 32
    munmap(0x2000000000000000, 65536)       = 0
    exit(0)                                 = ?

    But if we use a program that does getpid() in a loop:

    % pfmon --no-cmd-out -k --checkpoint-func=sys_getpid -e ia64_inst_retired getpid 1000
                      3003 IA64_INST_RETIRED
    This one does 1000+1 (3003/3) calls, obviously a 1000 from the loop plus an extra one somewhere else.

9/ Sampling with pfmon

   Pfmon has support for sampling on any events or combination of events. Samples are collected
   into a buffer which can then be written to a file or simply on the screen. Sampling is supported
   for both system-wide mode and per-process mode including when following fork,vfork, and pthread_create.

   Pfmon can sample is many different ways depending on the underlying CPU or what is being measured.
   For a given sampling measurement, pfmon can present the information differently. For instance it
   can be presented in clear text with decoding of certains timings, or it can be dump as a table
   with the values uninterpreted. What control the output is called a pfmon sampling module and we
   detail what they do in the next sections, but first we need to introduce what sampling with
   pfmon and Linux/ia64 perfmon-2 means.
   
   a/ principles

   Sampling is a technique to record certain information at some interval. The information is recorded
   into a sample which is stored into a sampling buffer. The interval can be time-based or more likely
   event-based and the length can be fixed or can vary. A typical example is the case where you want
   to record where you are (IP) each time you have executed one million instructions. 

   Each sample is composed of two parts, a fixed size header which contains general information about 
   the sample and a variable size body which consists of a set of 64-bit values each one corresponding 
   to a PMD register representing the other events being monitored. All samples record the same set
   of PMDs, this set is determined by pfmon based on what is requested by the user.
   
   The sampling buffer is controlled by the kernel but its size is configurable. Depending on
   the sampling module used in pfmon, the default size of the buffer can vary. The common module
   use a size of 2048 entries. Each sampling module may offer options to change that size.

   Sampling works as follows:
   	1- the user specifies what needs to be recorded.
	2- the user specifies the sampling period and optional randomization parameters.
	3- at the end of a period, a sample is recorded into the buffer by the kernel.
	4- a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3.
	5- if the sampling buffer becomes full, pfmon is notified.
	6- pfmon processes the buffer, i.e., prints and/or saves the buffer depending on the sampling module.
	7- pfmon then notifies the kernel that it is done.
	8- the kernel reload new sampling periods and execution/monitoring resumes. we go back to 3.

   Pfmon and the kernel perfmon-2.x subsystem allow monitoring sessions with multiple sampling periods.
   In case where more than one sampling periods expire at the same time, there is one sample per
   expired period recorded, thereby multiple samples recorded, subject to the space left in the buffer.

   b/ sampling periods

   Pfmon (and the perfmon-2.x ) uses two sampling periods instead of just one. The first one is called
   short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used 
   in step 4, this is when the sampling buffer is not full after writing the sample. The 
   long-smpl-period is used in step 8 when the reload occurs after the buffer became full. 

   But why do we need 2 periods?

   As you might imagine there is some overhead when recording a sample. This overhead is
   increased even more when pfmon needs to get involved to drain the buffer. This operation
   can take some time and will inevitably introduce some "noise" in the measurements in the form
   of TLB and/or cache pollution. To try and hide this "noise", it is sometimes beneficial to
   adjust the sampling period, i.e., make it bigger to ensure that the next sample will not
   record an event that is the consequence of the overhead generated by the monitoring but rather
   a normal event for the program/system being monitored. It is expect that the 
   long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to 
   having only one sampling period. Note that the long-smpl-period is only used for 
   the first sample recorded after the buffer is marked as empty again (step 7). Pfmon
   also use the long period first the initial period, i.e., for the first sample of the
   measurement.

   The long sampling period is specified via the --long-smpl-periods options and similarly
   --short-smpl-period is used for the short sampling period.

   % pfmon --long-smpl-periods=5000 --short-smpl-period=2000 -e l3_misses ....

   In the command line above, we take a sample every 2000 L3 misses at the user level.
   When the buffer becomes full, pfmon processes the data and restart monitoring.
   The next sample will come after 5000 L3 misses, subsequent samples will be 2000 misses
   apart.

   Pfmon supports multiple sampling periods. Each time one period elapse a sample
   is recorded. Hence with multiple sampling periods you can, in effect, create an
   OR condition: after N occurences of event A OR M occurences of event B record
   a sample. The number of sampling period is limited by the number of counting
   monitors supported by the underlying PMU. On Itanium and Itanium2 the limit
   is 4 counters, thus 4 sampling periods.

   When multiple sampling periods are used, they must be specify as a comma-separated
   list (no space). They apply in order tto the events specified with the -e option:

   % pfmon --long-smpl-periods=4000,3000 --short-smpl-periods=2000,1000 -el2_misses,l3_misses ...

   Here, the long and short periods and l2_misses are respectively 4000 and 2000. For l3_misses
   they are 300 and 1000 respectively.

   At a minimum, You must specify either one of the two types of sampling periods. Pfmon
   automatically adjusts the other one to be equal in that case. Hence the following:

   % pfmon --long-smpl-periods=4000 -el2_misses ...

   implies that the short period for l2_misses is 4000. You can see this with the --verbose
   option:

   % pfmon --verbose --long-smpl-periods=4000 -el2_misses ...
   loaded 10806 symbols from /proc/kallsyms
   loaded 1353 symbols from ELF file noploop
   measuring at user privilege level ONLY
   using detailed-itanium2 sampling module
   long  sampling periods(val/mask/seed): 2000/0x0/0
   short sampling periods(val/mask/seed): 2000/0x0/0
   ...

   c/ pfmon sampling modules

   There are many ways in which the samples can be saved or printed on the
   screen. Pfmon offers a choice of sampling module depending on the underlying
   CPU and what is being measured. There a few builtin modules but you are free
   to add you own should you jhave specific needs. Pfmon includes a template
   module call example-ia64.

   To find out the list of builtin modules, simply use the -I option:
   % pfmon -I
   detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
   supported PMU models: [itanium2] [itanium] [generic IA-64] 
   detected host PMU: itanium2
   supported sampling modules: [detailed-itanium2] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
   pfmlib version: 3.0
   kernel perfmon version: 2.0

   Because some modules are dependent on the underlying CPU, i.e., PMU model, the list
   of choices may vary from one processor to another. The above example is for Itanium2
   but on Itanium you would get:

   % pfmon -I
   detected host CPUs:  2-way 733MHz/2.0MB Itanium (Merced, B3)
   supported PMU models: [itanium2] [itanium] [generic IA-64] 
   detected host PMU: itanium
   supported sampling modules: [detailed-itanium] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
   pfmlib version: 3.0
   kernel perfmon version: 2.0

   Notice that "detailed-itanium2" has been replaced by "detailed-itanium" which is another
   module specific to Itanium (1).

   A sampling module is called when the sampling buffer becomes full and needs to
   be drained (dumped). Each module is free to process the data in any way it
   wants. Some of them can be specialized in ONLY a certain type of measurements.

   But in general the builtin formats accept any kind of measurements. They simply
   differ in the way they present the information:

   [detailed-itanium2] : prints detailed information about each sample. The header
   		  	 and body are decoded and are directly meaningful to user.
			 This module is Itanium 2 specific.

   [detailed-itanium]  : prints detailed information about each sample. The header
   		  	 and body are decoded and are directly meaningful to user.
			 This module is Itanium  specific.

   [compact-ia64]      : prints the header and body with one line per sample and just
   			 raw values. No interpretation is attempted. This module works
			 for all CPU models.

   [raw-ia64]          : the buffer is dumped as is into a file. This module works
			 for all CPU models.

   [btb]               : prints the Branch Trace Buffer (BTB) with one branch per line.
   			 Value are arranged in column-style with no interpretation.
			 You must use BRANCH_EVENT as the sampling period. This module
			 works for Itanium and Itanium2 only.

   [example-ia64]      : a simple template module for people who want to write their
   			 own sampling module. 
  
   By default, pfmon auto-detect the host CPU and picks the first matching sampling
   module. On Itanium2, the default is [detailed-itanium2]. You can force a specific
   module by using the --smpl-module option:

   % pfmon no-cmd-output --smpl-module=compact-ia64 --long-smpl-periods=1000000 -ecpu_cycles noploop 10000000
   0        18222    0  0x40000000000003a0 0x000312854d7e7f27   4 1000000 
   1        18222    0  0x40000000000003a0 0x000312854d8dff32   4 1000000 
   2        18222    0  0x40000000000003a0 0x000312854d9d5351   4 1000000 
   3        18222    0  0x40000000000003a0 0x000312854daca5bf   4 1000000 
   4        18222    0  0x40000000000003a0 0x000312854dbbf839   4 1000000 
   5        18222    0  0x40000000000003a0 0x000312854dcb4915   4 1000000 
   6        18222    0  0x40000000000003a0 0x000312854dda9e7a   4 1000000 
   7        18222    0  0x40000000000003a0 0x000312854de9ef8b   4 1000000 
   8        18222    0  0x40000000000003a0 0x000312854df94035   4 1000000 
   9        18222    0  0x40000000000003a0 0x000312854e089090   4 1000000 
   10       18222    0  0x40000000000003a0 0x000312854e17e0b8   4 1000000 
   11       18222    0  0x40000000000003a0 0x000312854e273102   4 1000000 
   12       18222    0  0x40000000000003a0 0x000312854e368116   4 1000000 

   You can find out more about each module by using the -S or --smpl-module-info option:

   % pfmon -S btb
   name        : btb
   description : IA-64 column-style BTB raw values. Must use BRANCH_EVENT event only.
   PMU models  : [itanium] [itanium2] 
   options     :
           --smpl-entries=val              set number of entries for sampling buffer

   The name of the module along with a short description is printed. Then the list of supported
   PMU models. Finally, optional module-specific options are listed.

   Each module can have module-specific options. Those options become known to pfmon ONLY once
   the module is known. That means that if you do not choose the default sampling module, then
   you MUST specify the module name, via --smpl-module, BEFORE you can use the module-specific
   options.

   % pfmon --no-cmd-out --smpl-module=btb --smpl-entries=1024 --long-smpl-periods=8192 -ebranch_event noploop 10000

   Here --smpl-entries appears AFTER the --smpl-module selection.

   You can list all the modules with all their options simply by typing:

   % pfmon --help
   ....
   options for "detailed-itanium2" sampling format:
        --smpl-entries=val              set number of entries for sampling buffer (default 2048)
   options for "raw-ia64" sampling format:
        --smpl-entries=val              set number of entries for sampling buffer (default 2048)
   options for "compact-ia64" sampling format:
        --smpl-entries=val              set number of entries for sampling buffer (default 2048)
   options for "btb" sampling format:
        --smpl-entries=val              set number of entries for sampling buffer (default 2048)
   options for "example-ia64" sampling format:
        --smpl-entries=val              set number of entries for sampling buffer (default 2048)


   Notice that the default buffer size is printed.

   c/ basic sampling examples

   Suppose you want to record how many instructions are retired every 50000 cycles, i.e.,
   you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in 
   each sample. This can be done as follows:

   % pfmon --no-cmd-out --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null

   The two periods are identical in this example because the number of instructions executed
   by the ls command is not influenced by the fact that we monitor. 

   Next, we take a sample every 50000 cpu cycles OR each time 10000 instructions have been retired:

   % pfmon --no-cmd-out --short-smpl-period=50000,10000 --long-smpl-period=50000,10000 -e cpu_cycles,ia64_inst_retired ls


   Let us look at the information in the sampling buffer for the detailed-itanium2 sampling module. 
   For the first example above, we get something like this printed on the screen:
   
	entry 0 PID:18291 CPU:0 STAMP:0x31444f4ed7297 IIP:0x2000000000023ec0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000004c96
        entry 1 PID:18291 CPU:0 STAMP:0x31444f4f0aebc IIP:0x2000000000021370
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x000000000000f4f0
        entry 2 PID:18291 CPU:0 STAMP:0x31444f4f1a9b2 IIP:0x2000000000014ab0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000015bc7
        entry 3 PID:18291 CPU:0 STAMP:0x31444f4f27e8e IIP:0x2000000000014a81
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x000000000001b799
        entry 4 PID:18291 CPU:0 STAMP:0x31444f4f34f65 IIP:0x2000000000014c11
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000024d63
        entry 5 PID:18291 CPU:0 STAMP:0x31444f4f4201c IIP:0x2000000000021601
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000033113
		.....

   You see the entries extracted from the sampling buffer.  Entry 0 is the first entry recorded in 
   this monitoring session. The first line of each sample (entry) shows the fixed header. The fields 
   are as follows:

   	- PID     : the identity of the process that generated the event
	- CPU     : the CPU number on which the event occurred
	- STAMP   : a time stamp guaranteed to be unique in time per CPU.
	- IIP     : the value of the IP when the PMU interrupt occurred (DANGER, see note below)
	- OVFL    : the counter that triggered the recording of the sample.
	- LAST_VAL: the last value loaded into the counter which overflowed

   VERY IMPORTANT NOTE:
   Users are advised NOT TO TRUST the value reported in IIP.  Samples get recorded by forcing a counter overflow 
   which then triggers an interrupt which will cause the kernel to record the information. Because of the
   parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes
   that there was a counter overflow and generates the interrupt, the program execution has progressed beyond
   the instruction that caused the event. This very often leads to the a skewed IIP. At best IIP points to the 
   next bundle given that interrupts can only be delivered at bundle boundaries.

   After the header, you get the value of PMD5. This register contains the number of instructions retired for our
   example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the
   events were dispatched among the various PMDs, you can use the --with-header option (described earlier). 
   The header contains detailed machine and session descriptions. In our case it would like as follows:

   #
   # date: Mon Nov  3 14:21:18 2003
   #
   # hostname: tuff.hpl.hp.com
   #
   # kernel version: Linux 2.6.0-test7 #200 SMP Fri Oct 24 14:11:06 PDT 2003
   #
   # pfmon version: 3.0
   # kernel perfmon version: 2.0
   #
   #
   # page size: 16384 bytes
   #
   # CLK_TCK: 1024 ticks/second
   #
   # CPU configured: 0
   # CPU online    : 2
   #
   # physical memory          : 4232495104 bytes (4036.4 MB)
   # physical memory available: 3248488448 bytes (3098.0 MB)
   #
   # host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
   #       PAL_A: 0.7.31
   #       PAL_B: 0.7.40
   #       Cache levels: 3 Unique caches: 4
   #       L1D:    16384 bytes, line  64 bytes, load_lat   1, store_lat   3
   #       L1I:    16384 bytes, line  64 bytes, load_lat   1, store_lat   0
   #       L2 :   262144 bytes, line 128 bytes, load_lat   5, store_lat   7
   #       L3 :  3145728 bytes, line 128 bytes, load_lat  12, store_lat   7
   #
   #
   # captured events:
   #       PMD4: CPU_CYCLES
   #       PMD5: IA64_INST_RETIRED
   #
   # privilege levels:
   #       PMD4: CPU_CYCLES        = user
   #       PMD5: IA64_INST_RETIRED = user
   #
   # monitoring mode: per-process
   #
   # instruction sets:
   #       PMD4: CPU_CYCLES        = ia32/ia64
   #       PMD5: IA64_INST_RETIRED = ia32/ia64
   #
   #
   # command: pfmon --no-cmd-out --with-header --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null
   #
   # command line     : ls /dev/null 
   #
   # process id       : 18315
   # thread  id       : 18315
   # parent process id: -1
   # parent thread  id: -1
   # exec count       : 1
   #
   #
   #
   # recorded PMDs when CPU_CYCLES overflows: PMD5 
   #
   # short sampling rates (base/mask/seed):
   #       CPU_CYCLES 50000
   #       IA64_INST_RETIRED none
   #
   # long sampling rates (base/mask/seed):
   #       CPU_CYCLES 50000
   #       IA64_INST_RETIRED none
   #
   #
   # min sampling buffer entries: 2048
   #
   #
   
   Near the middle of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED.
   Later on you see the line "recorded PMDs when..." showing which PMD gets recorded in the body of each sample.
    
   Pfmon records the value of the PMD for which the event has no sampling period defined. For our
   first example, it means that it will record the value of the PMD counting the number of instructions 
   retired. Let us look at a more complicated example using some of the Itanium2  specific events:

   % pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \
      -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null

   Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting
   the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES):

   ...
   entry 1 PID:18326 CPU:0 STAMP:0x3153ce9a31d98 IIP:0x2000000000014e01
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x000000000000f0c5
        PMD6 : 0x00000000000004da
        PMD7 : 0x0000000000000044
   ...

   Where the assignments were: 

   # captured events:
   #       PMD4: CPU_CYCLES
   #       PMD5: IA64_INST_RETIRED
   #       PMD6: L2_MISSES
   #       PMD7: CPU_CPL_CHANGES
   ...
   # recorded PMDs when CPU_CYCLES overflows: PMD5 PMD6 PMD7 


   Using the compact-ia64 module instead of the detailed-itanium2, you get results that are formatted such that they can be 
   easily parsed by other tools. The header contains the description of every
   column:

   % pfmon --smpl-module=compact-ia64 --no-cmd-out --with-header --short-smpl-periods=50000 \
    --long-smpl-periods=50000 -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null

   0        18336    0  0x2000000000022bc0 0x00031564d69054aa   4 50000 0x000000000000551d 0x000000000000027b 0x0000000000000026 
   1        18336    0  0x2000000000013710 0x00031564d6931eb3   4 50000 0x0000000000010e2f 0x0000000000000536 0x0000000000000047 
   2        18336    0  0x2000000000014a90 0x00031564d6941a54   4 50000 0x00000000000180ab 0x00000000000006b7 0x000000000000004e 

   The header contains the information to figure the content of each column:

   # description of columns:
   #       column  1: entry number
   #       column  2: process id
   #       column  3: cpu number
   #       column  4: instruction pointer
   #       column  5: unique timestamp
   #       column  6: overflowed PMD index
   #       column  7: initial value of first overflowed PMD
   #
   #       when PMD4(CPU_CYCLES) overflows: 
   #               column  8: PMD5
   #               column  9: PMD6
   #               column 10: PMD7

   d/ writing samples to a file

   By defualt, all sampling module output to the controlling terminal. This may be overwhelming
   at times. Therefore it is strongly recommended that samples be redirected to a file using the
   --smpl-outfile option. This option is different from the --outfile option. The latter is where
   the final counts are saved, whereas the former is where the samples are saved. By default,
   no counts are printed in sampling mode.

   The filename specified for the --smpl-outfile is in fact just the basename. It is appended
   by a suffix which depends on the monitoring sessions type. For system-wide session, you get
   one file per cpu and the .cpuXX suffix is added with XX represent the CPU number, starting
   at zero. If aggregation is used then nothing is appended.

   In per-process mode, the filename is also the base in case some of the follow-* options are
   used and you get one file per task monitored. See section g/ below for more details.

   e/ resetting non-sampling counters 

   Oftentimes you want to know how many occurences of an event during specific period. We saw
   that sampling gives you this kind of information. For instance, in our first example we
   want to know how many instruction retired for every 50000 cycles. The instructions retired
   count is recorded in every sample but this the accrued count not the count since last sample.
   It is fairly easy to correct this by substracting the count for the previous sample but this
   requires a pass over the entire buffer. Pfmon and the kernel perfmon-2.x subsystem support
   an option to force a reset of the non-sampling event each time a sample is recorded. For
   pfmon this is the --reset-non-smpl-periods options.

   Let us examine the effect of the --reset-non-smpl-periods  options. goping back to our first
   example:

   % pfmon --no-cmd-out --short-smpl-period=50000 -e cpu_cycles,ia64_inst_retired ls

	entry 0 PID:18291 CPU:0 STAMP:0x31444f4ed7297 IIP:0x2000000000023ec0
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x0000000000004c96
        entry 1 PID:18291 CPU:0 STAMP:0x31444f4f0aebc IIP:0x2000000000021370
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x000000000000f4f0
        entry 2 PID:18291 CPU:0 STAMP:0x31444f4f1a9b2 IIP:0x2000000000014ab0
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x0000000000015bc7
        entry 3 PID:18291 CPU:0 STAMP:0x31444f4f27e8e IIP:0x2000000000014a81
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x000000000001b799
        entry 4 PID:18291 CPU:0 STAMP:0x31444f4f34f65 IIP:0x2000000000014c11
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x0000000000024d63
        entry 5 PID:18291 CPU:0 STAMP:0x31444f4f4201c IIP:0x2000000000021601
           OVFL: 4 LAST_VAL: 50000
           PMD5 : 0x0000000000033113

   Here the value of PMD5 is the accrued value since the beginning of the session.
   Therefore to figure out how many instructions were retired between entry 2 and
   entry 3, one must do 0x000000000001b799 - 0x0000000000015bc7 = 0x5bd2 = 23506

   Now with the option:

   % pfmon --no-cmd-out --short-smpl-period=50000 --reset-non-smpl-periods -e cpu_cycles,ia64_inst_retired ls

   entry 0 PID:18353 CPU:0 STAMP:0x315e7c8493a68 IIP:0x200000000000fd90
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000005b74
   entry 1 PID:18353 CPU:0 STAMP:0x315e7c84c0902 IIP:0x20000000000136e0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x000000000000b3c2
   entry 2 PID:18353 CPU:0 STAMP:0x315e7c84cf6ea IIP:0x2000000000014a72
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000007768
   entry 3 PID:18353 CPU:0 STAMP:0x315e7c84dcbcf IIP:0x2000000000014ba0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000007b74
   entry 4 PID:18353 CPU:0 STAMP:0x315e7c84e9b57 IIP:0x20000000000137f0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x0000000000010095
   entry 5 PID:18353 CPU:0 STAMP:0x315e7c84f65a9 IIP:0x20000000000231b0
        OVFL: 4 LAST_VAL: 50000
        PMD5 : 0x000000000000b345

   You easily notice that the count for PMD5 is reset between two samples. No further calculation
   is needed to answer our initial question.

   f/ blocking on overflow notifications

   Whenever the sampling buffer becomes full and pfmon is notified you have the option of either 
   letting the monitored program continue or block it. In both cases, monitoring is off during the 
   processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is 
   possible to block the program using the --overflow-block option. Blocking the program ensures pfmon 
   sees the entire execution. Keeping the program running ensures that the caches and TLB are kept 
   somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems.
   The option is support in all types of per-process sessions including the ones which are following
   execution through fork(), vfork(), and pthread_create().

   This options is not supported in system-wide mode.

   g/ sampling and the --follow-* options

   Sampling is supported across fork(),vfork(), and pthread_create(). It is also supported when
   sampling across exec(). The samples are collected per-process, i.e., really per-task in Linux
   term. In a multithreaded application, you get samples for each thread.

   By default all sampling modules write to the controlling terminal, hence it may be difficult
   to distinguish which samples come from which task and output may be interleaved. However, pfmon
   does serialize terminal output. This means that samples output in batched per task,i.e., no
   interleaving possible during a dump of a buffer. 

   The best way to handle multi process/thread sampling sessions is to redirect the output to a file 
   using the --smpl-outfile option. The filename passed to the option serves as a basename very much
   like what is done for the --outfile option that we described in 4/e/. 

   Let us look at our compilation example:

   % pfmon --follow-all --no-cmd-out --smpl-outfile=gcc-smpl --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
   % ls -l gcc-smpl.*
   2174981 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24410.24410
   2175304 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24411.24411
   2175313 -rw-r--r--    1 eranian  users        1046 Nov  4 17:31 gcc-smpl.24412.24412
   2175323 -rw-r--r--    1 eranian  users         380 Nov  4 17:31 gcc-smpl.24413.24413
   2175327 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24414.24414
   2175337 -rw-r--r--    1 eranian  users       14634 Nov  4 17:31 gcc-smpl.24415.24415

   Just like for --outfile, the process id and the thread id are appended to the provide filename. 
   The thread id differs from the process id only when following threads. For processes, the two numbers are identical:
   
       gcc-smpl.24410.24410
       ^        ^     ^
       |        |     |
       |        |     thread id
       |        process id
       base filename

   You can get detailed information included in each file using the --with-header option, just like for --outfile.

   The --exec-split-results and --follow-exec-* options have the same effects as --outfile on the generation of
   the sampling output files that they do for the counts. When --exec-split-results is used, the exec number if
   appended to the sampling output file:

   % pfmon -exec-split --follow-all --no-cmd-out --smpl-outfile=gcc-smpl \
     --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
   % ls -l gcc-smpl.*
   2174981 -rw-r--r--    1 eranian  users        1683 Nov  4 17:38 gcc-smpl.25727.25727.1
   2175304 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25728.25728.0
   2175312 -rw-r--r--    1 eranian  users        1990 Nov  4 17:38 gcc-smpl.25728.25728.1
   2175317 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25729.25729.0
   2175323 -rw-r--r--    1 eranian  users        2717 Nov  4 17:38 gcc-smpl.25729.25729.1
   2175332 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25730.25730.0
   2175333 -rw-r--r--    1 eranian  users        1982 Nov  4 17:38 gcc-smpl.25730.25730.1
   2175336 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25731.25731.0
   2175337 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25731.25731.1
   2175341 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25732.25732.0
   2175342 -rw-r--r--    1 eranian  users       16568 Nov  4 17:38 gcc-smpl.25732.25732.1
 
   If we look at the header of the largest file, we find out that this is the linker, which
   is exec'ed from the collect2 process:

   # command line     : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/...
   #
   # process id       : 25732
   # thread  id       : 25732
   # parent process id: 25731
   # parent thread  id: 25731
   # exec count       : 1

   Some of the files can be empty (in case --with-header is not specified) or just with a header
   which means that no samples was collected for this portion of the execution.

   h/ sampling in system wide mode

   Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the
   controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When
   sampling results are redirected into a file with the --smpl-outfile option, you get one file per CPU. If 
   the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 
   'myresults.cpu1' the ones from CPU1, and so on:

   % pfmon --system-wide --smpl-outfile=sysout --long-smpl-periods=1000000 --session-timeout=3
   <session to end in 3 seconds>
   % ls -l sysout*
   sysout.cpu0  sysout.cpu1
   % ls -l sysout.cpu*
   -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu0
   -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu1

   The --aggregate-results options influences the way samples are saved to files. When this option is used,
   the samples are merged into a single file. In our example, they would go into 'myresults':
   
   i/ randomization of sampling periods

   Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value
   using the --smpl-periods-random option. The same mask and seed applies to both the long and short period
   for each event. Each event (sampling period) can have a different mask and seed. Two separate invocations 
   of pfmon using the same seed and mask arguments are guaranteed to generate to same "pseudo-random" 
   series of numbers allowing reproducibility.

   Most builtin sampling modules, report the last sampling period value, i.e., the value that was loaded into
   the sampling period to record a sample. 

   In the following example, we use the [detailed-itanium2] module. The long (and short) sampling period are 
   initially set to 100000 and we activate randomization using a seed of 5. The mask indicates that we allow 
   the value to vary between 100000 and 100255 (inclusive):

   % pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 -e cpu_cycles -- noploop 1000000000

   entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100000
   entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100005
   entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100067
   entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100181
   entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100064
   entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100212
   entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400
        OVFL: 4  LAST_VAL: 100140


   The randomization is shown in the LAST_VAL field. Here is corresponds to the value loaded 
   into PMD4 (the PMD which overflowed) for each sample. Hence, 100181 is the number of elapsed
   cycles between entry 2 and entry 3.

   Randomization is important when sampling to avoid getting in lockstep with the execution
   and thereby collecting biased results.

   j/ Sampling and symbol resolution example

   Let us use an example which stresses the application and the kernel to show
   how both the application and the kernel addresses are resolved. We use a
   stupid getpid() loop program:
   
   % pfmon -u -k --no-cmd-out --resolve-addresses --long-smpl-period=1000000 -- getpid 100000

   We monitor at both the kernel and user level to record samples at both levels.
   The sample we are getting look as follows:

   entry 0 PID:16661 CPU:0 STAMP:0xdc1e3b599f02 IIP:kernel:break_fault+0x120
        OVFL: 4 LAST_VAL: 1000000
   entry 1 PID:16661 CPU:0 STAMP:0xdc1e3b68eff9 IIP:kernel:break_fault+0x120
        OVFL: 4 LAST_VAL: 1000000
   entry 2 PID:16661 CPU:0 STAMP:0xdc1e3b7836ba IIP:app:getpid+0x2
        OVFL: 4 LAST_VAL: 1000000
   entry 3 PID:16661 CPU:0 STAMP:0xdc1e3b877d3b IIP:app:getpid+0x2
        OVFL: 4 LAST_VAL: 1000000
   entry 4 PID:16661 CPU:0 STAMP:0xdc1e3b96c324 IIP:app:getpid+0x2
        OVFL: 4 LAST_VAL: 1000000
   entry 5 PID:16661 CPU:0 STAMP:0xdc1e3ba608df IIP:kernel:break_fault+0x120
        OVFL: 4 LAST_VAL: 1000000
   entry 6 PID:16661 CPU:0 STAMP:0xdc1e3bb54ebd IIP:kernel:break_fault+0x120
        OVFL: 4 LAST_VAL: 1000000
   entry 7 PID:16661 CPU:0 STAMP:0xdc1e3bc4944c IIP:kernel:break_fault+0x120
        OVFL: 4 LAST_VAL: 1000000

   You see that the interrupted IP (IIP), i.e., where we were when the PMU
   interrupted was generated is resolved to a symbol name. Sometimes we are
   in the kernel such as at kernel:break_fault+0x12. The +0x12 is the offset
   from the base address of the symbol. If we look at the code for the symbol
   app:getpid+0x2, we get:
   400000000000df00 <__getpid>:
   400000000000df00:       [MII]       mov r15=1041
   400000000000df06:                   break.i 0x100000;;
   400000000000df0c:                   nop.i 0x0
   400000000000df10:       [MFB]       nop.m 0x0
   400000000000df16:                   nop.f 0x0
   400000000000df1c:                   br.ret.sptk.few b0;;
 
   The +0x2 offset is a byte offset, meaning we are in the first bundle, most
   likely at the break.i instruction.

   k/ miscellaneous sampling options 

   When sampling, the counts printed at the end of the session are not very useful, especially for
   the counters used as sampling periods. By default, pfmon does not print them at the end of a 
   sampling session. You can change this by using the --smpl-print-counts option.

   For certain measurements, it is interesting to put the sampling buffer in "saturation" mode.
   This means that monitoring stops as soon as the buffer fill up. Pfmon is not notified at
   all. In the end, we collect the first N samples, where N is determine by the size of
   the sampling buffer. The processes and session continues until normal termination, but
   no monitoring is active. You can turn on this mode using the --saturate-smpl-buffer.

8/ Further documentation

   You can find a lot of information about the Linux/ia64 kernel in the book:

   	'ia-64 linux kernel design and implementation' 
   	David Mosberger and Stephane Eranian
	Prentice Hall
	ISBN: 0130610143
	Also see http://www.lia64.org for the book's web site.

   This book contains a chapter about the IA-64 PMU, the design of the kernel perfmon subsystem
   and also a small description of pfmon. It is outdated by now as it describes the kernel 
   perfmon-1.x interface but this is a good starting point. The book alsa contains an in-depth
   description of the IA-64 PMU.

   More detailed information about the IA-64 architecture, including the PMU can be found
   on the Intel developers' web site at:

	http://developer.intel.com/design/itanium/family/

10/ Support

   The perfmon project (includes the kernel subsystem and related tools) have a web site
   at:
   	http://www.hpl.hp.com/research/linux/perfmon

   There exists a mailing dedicated to perfmon related discussions. Go to
   the perfmon web site for more details.
   subscribe at:

11/ Bug reports

   Send all bug reports to the perfmon mailing list 

   Patches are also welcomed.

11/05/2003
S.Eranian <eranian@hpl.hp.com>
