CODES.txt - Notes on the format of the bSAM cache file
Copyright (C) 2007 Hewlett-Packard Development Company, L.P.

Tue Jan 31 09:34:32 MST 2006

New bSAM binary data format:
All data entries are in the same format:
  type (2 bytes)
  size (2 bytes, unsigned) -- size of the data (can be zero meaning "no data")
  data (size matches size)

Basic Type categories:
Type & 0xFFF0 == 0x0000 :: File oriented
Type & 0xFFF0 == 0x0100 :: Function oriented
Type & 0xF000 == 0xF000 :: Comment

All strings should be null terminated!

The different types:
0000 EOF -- size and data are not required
0001 File name -- data is string
0002 File checksum -- data the checksum (may be a string)
0003 File license -- data string
0004 File type -- data string (e.g., "C", "Java", "Class", "Obj")
	File type is used to ensure that comparisons are only done
	between same type of data
0010 File unique value -- data string

0101 Function name -- data is string
0103 Function license -- see 03 File license
0104 Function type -- see 04 File type
0108 Function tokens -- data contains tokens, 2 bytes each!
0110 Function unique value -- data string
0118 Function tokens OR list -- 2 byte tokens, at least one of which must
	be in the comparison token list.  (None are in the comparison?
	Then skip the comparison!)
0128 Function tokens AND list -- 2 byte tokens, all must be in the comparison
	token list.
0131 Byte offset to start in untokenized file (length is always 4)
0132 Byte offset to end in untokenized file (length is always 4)
	Both contain 4 bytes.
	0131 and 0132 values should be reset to "undefined" each time
	0101 is seen.
0138 Byte offsets for tokens.
	Each token in the 0108 tag will be represented by 1 byte here.
	This byte is the number of bytes to skip between tokens.
	This way, matches can be calculated down to specific locations
	in the file rather than general ranges specified by 0131 - 0132.
	There is one extra byte here so the end of the last token is known.
	E.g., if the match starts at token #7, then:
	  for(i=0; i<7; i++) RealOffset += TokenValue0138[i].
0140 Single-sentence licenses (text, tokenized into space separated)
01FF End of Function (ok to start processing) -- size is always zero.

F001 File Comment
F101 Function Comment
FFFF General Comment

All unknown types are skipped (treated as comments).


=====================================================================
Wed Feb  1 12:48:49 MST 2006

With the new data format, I don't need to separate .o, .java, .c, and .class
files.  I can store them all in one directory.
BUT: Different file formats need different bsam parameters.
(The number of similar tokens varies based on the language.)
For this reason, I am still keeping them separate.

