parseDoc {ssrch} | R Documentation |
parse a document and place content in a DocSet
parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_, rec_id_field = "experiment.accession", exclude_fields = c("study.accession"), substrings_to_omit = c("http://purl.obolibrary.org/obo/"), patterns_to_kill = "....-..-..|.*...,...", token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25, min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$", "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))
csv |
a character(1) CSV file path |
DocSetInstance |
if NULL, DocSet is initialized in this function, otherwise the instance is updated with new content |
doctitle |
character(1) document title |
rec_id_field |
character(1) field in CSV identifying records |
exclude_fields |
character vector of fields to ignore while parsing |
substrings_to_omit |
character vector of strings to remove from candidate keywords via gsub |
patterns_to_kill |
character(1) regexp that identifies tokens to be omitted from keyword set |
token_fixups |
a list if character(2) vectors that will be |
max_tok_nchar |
numeric(1) defaults to 25, tokens with more characters will be truncated to this length and suffixed with ellipsis |
min_tok_nchar |
numeric(1) defaults to 4, tokens shorter than this are not in index used with gsub() to repair irregularities. For example ‘c("t”", "t’")‘ will transform 'Burkitt”s' to 'Burkitt’s' |
cleanFields |
list of regular expressions identifying fields to ignore |
instance of DocSet
myob = ssrch::docset_cancer68 td = tempdir() alld = ls(docs2kw(myob)) r1 = retrieve_doc(alld[1], myob) expo = write.csv(r1, paste0(td, "/expo.csv")) parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]])