$Id: IDEAS 200 2005-02-15 02:24:51Z chasalos $

SDC  1.  Re treatment of missing genotypes or alleles.
	Tentative proposal, needs discussion and confirmation:

	A.  Can a completely unknown genotype be encoded by a missing
	value (NA) in the callCodes slot?

       Scott votes NO.  I don't see much benefit.  Since we will allow
	some missing values to be encoded via levels in the transTables
	slot, I propose that that be the ONLY way we allow encoding of
	missing values.  This has the advantage of consistency, and should
	make life a wee bit easier for programmers.  If we are feeling
	kind, we might still check for missing values in the callCodes
	slot.  But I propose that they result in a call to stop().  This
	also has the advantage of explicitness.  A user, or programmer for
	that matter, must explicitly specify which levels are to be 
	interpreted as "missing" (see missingCodes slot, below).

	B.  Proposal:  one or more missing allele values are given
	explicit levels in the translation tables.  Then, they are encoded
	with integers in the callCodes slot in the same way that
	non-missing values are.  For example, you could include levels
	"A/NA", "NA/A", and "NA/NA".  The first two of these levels
	indicate one missing allele in a diploid genotype.  The third
	level indicates two missing alleles, and hence would be equivalent
	to an NA in the callCodes slot.  The extension to polyploids is
	obvious, e.g.  "A/NA/NA/NA" for a tetraploid with one known allele.

	C.  Some reasons to store partially missing genotypes:
		(1) hemizygotes (i.e. X-chromosome loci in human males)
		(2) dominant loci
 
     D.  Should applications automatically treat the string or
     substring "NA" in a level as missing?  Or should we include a
     slot, or e.g. attribute of transTables components, to specify
     what string is to be interpreted as a missing value for a
     particular dataset.  If so, would want it to be able to vary
     among components of the translation table list.
 
	I propose we add a "missingCodes" slot to the data structure for
	this purpose.  This is a list, similar to the transTables slot.
	The names of the list MUST match the names of the transTables list
	exactly.  Each component is a character vector containing allele
	strings to be interpreted as missing for a particular locus type.

	This data representation would make encoding of missing values
	utterly explicit.  And because each component can be a vector of
	length > 1, it would allow faithful storage of e.g. taqman data.
	When I have worked with taqman data, we distinguished multiple
	types of missingness, e.g.:  NS = no signal, FL = between clusters,
	OL = outlier.  It would be nice to be able to retain such
	distinctions in the data structure, but to be able to easily
	flag all such strings as representing missing values.
	
	For convenience, we could write function "encodeMissing".  This
	takes the transTables and missingCodes slots and returns some sort
	of structure indicating, for each component, which callCodes
	values (integers) mean one allele missing, which mean two alleles
	missing, (and so on if ploidy > 2).

	E.  In R, as.character(NA) is a missing value of mode character.
	It is NOT the same as the string "NA".  This is a nice feature.
	I think, however, that we should NOT make any use of it.  For
	example, we should NOT allow as.character(NA) in the levels 
	column of a translation table.  Or, if we DO allow it, we should
	also allow missing allele values to be represented by other,
	non-missing strings as well, e.g. "NA".  This is because
	as.character(NA) IS the same as "NA" in S-PLUS, and I want to make
	the port from R to S-PLUS easy.

     F.  Probably want to add at least an option of creating missing-
     value levels to makeTransTable function, with an argument to
     specify what strings encode missing alleles.  Likewise, all
	import functions will need to include an argument to specify
	what string(s) in the input signify a missing genotype or allele.

SDC 2.  Re Data Import

	A.  Should look at e.g. BioC import functions for e.g. affy files
	for guidance.

	B.  Modularize!  For example, given a ped file:
		(1)  Import (read) file into R with minimal modification,
		creating e.g. a "pedRaw" object
		(2)  Convert the pedRaw object into a geneCodeSet object
		using e.g. gcs method for pedRaw objects:
		as(x, "geneCodeSet")?
		(3)  Write a "load" (for example) function for doing BOTH
		(1) and (2) in a single step.

SDC 3.  Re Data Export

	Should write some export functions, e.g. gcsExport(x, "ped") or
	gcsExport(x, "structure"),  fairly early.  This will help in
	testing.  A round of import and export should result in a file
	identical to the original file imported (modulo comment or other
	skipped lines, if any).  (NB:  Giovanni plans to write a gcsImport
	method for "structure" files, which is the format used by Jonathan
	Pritchard's Structure program.)

SDC 4.   We should not wait too long to institute more formal code
	testing.

	We can use available "validation" tools.  At minimum, we should
	store a valid geneCodeSet object, and have a function that should
	(a) create that object programmatically, (b) export that object to
	a file; and (c) import that file to create another geneCodeSet
	object.  The two created objects must be identical to the stored 
	object.

