read.dna {ape} | R Documentation |
This function reads DNA sequences in a file, and returns a list of DNA sequences with the names of the taxa read in the file as names of the vectors of the list. In order to be consistent with other functions in APE, the sequences are returned in lower case.
read.dna(file, format = "interleaved", skip = 0, nlines = 0, comment.char = "#", seq.names = NULL)
file |
a file name specified by either a variable of mode character, or a double-quoted string. |
format |
a character string specifying the format of the DNA
sequences. Three choices are possible: "interleaved" ,
"sequential" , or "fasta" , or any unambiguous
abbreviation of these. |
skip |
the number of lines of the input file to skip before beginning to read data. |
nlines |
the number of lines to be read (by default the file is read untill its end). |
comment.char |
a single character, the remaining of the line after this character is ignored. |
seq.names |
the names to give to each sequence; by default the names read in the file are used. |
This function follows the interleaved and sequential formats defined in PHYLIP (Felsenstein, 1993) but with the original feature than there is no restriction on the lengths of the taxa names (though a data file with 10-characters-long taxa names is fine as well). For these two formats, the first line of the file must contain the dimensions of the data (the numbers of taxa and the numbers of nucleotides); the sequences are considered as aligned and thus must be of the same lengths for all taxa. For the FASTA format, the conventions defined in the URL below (see References) are followed; the sequences are taken as non-aligned. For all formats, the nucleotides can be arranged in any way with blanks and line-breaks inside (with the restriction that the first ten nucleotides must be contiguous for the interleaved and sequential formats, see below). The names of the sequences are read in the file unless the `seq.names' option is used. Particularities for each format are detailed below.
A list a DNA sequences each made of a single vector of mode character where each element is a nucleotide.
Emmanuel Paradis paradis@isem.univ-montp2.fr
Anonymous. FASTA format description. http://www.ncbi.nlm.nih.gov/BLAST/fasta.html
Anonymous. IUPAC ambiguity codes. http://www.ncbi.nlm.nih.gov/SNP/iupac.html
Felsenstein, J. (1993) Phylip (Phylogeny Inference Package) version 3.5c. Department of Genetics, University of Washington. http://evolution.genetics.washington.edu/phylip/phylip.html
read.GenBank
, write.dna
,
dist.dna
, woodmouse
### a small extract from `data(woddmouse)' cat("3 40", "No305 NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT", "No304 ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT", "No306 ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT", file = "exdna.txt", sep = "\n") ex.dna <- read.dna("exdna.txt", format = "sequential") str(ex.dna) ex.dna ### the same data in interleaved format... cat("3 40", "No305 NTTCGAAAAA CACACCCACT", "No304 ATTCGAAAAA CACACCCACT", "No306 ATTCGAAAAA CACACCCACT", " ACTAAAANTT ATCAGTCACT", " ACTAAAAATT ATCAACCACT", " ACTAAAAATT ATCAATCACT", file = "exdna.txt", sep = "\n") ex.dna2 <- read.dna("exdna.txt") ### ... and in FASTA format cat("> No305", "NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT", "> No304", "ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT", "> No306", "ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT", file = "exdna.txt", sep = "\n") ex.dna3 <- read.dna("exdna.txt", format = "fasta") ### These are the same! identical(ex.dna, ex.dna2) identical(ex.dna, ex.dna3) unlink("exdna.txt") # clean-up