parseDoc {ssrch}R Documentation

parse a document and place content in a DocSet

Description

parse a document and place content in a DocSet

Usage

parseDoc(csv, DocSetInstance = new("DocSet"), doctitle = NA_character_,
  rec_id_field = "experiment.accession",
  exclude_fields = c("study.accession"),
  substrings_to_omit = c("http://purl.obolibrary.org/obo/"),
  patterns_to_kill = "....-..-..|.*...,...",
  token_fixups = list(c("t''", "t'"), c(":$", "")), max_tok_nchar = 25,
  min_tok_nchar = 4, cleanFields = list("..*id$", ".name$", "_name$",
  "checksum", "isolate", "filename", "^ID$", "barcode", "Sample.Name"))

Arguments

csv

a character(1) CSV file path

DocSetInstance

if NULL, DocSet is initialized in this function, otherwise the instance is updated with new content

doctitle

character(1) document title

rec_id_field

character(1) field in CSV identifying records

exclude_fields

character vector of fields to ignore while parsing

substrings_to_omit

character vector of strings to remove from candidate keywords via gsub

patterns_to_kill

character(1) regexp that identifies tokens to be omitted from keyword set

token_fixups

a list if character(2) vectors that will be

max_tok_nchar

numeric(1) defaults to 25, tokens with more characters will be truncated to this length and suffixed with ellipsis

min_tok_nchar

numeric(1) defaults to 4, tokens shorter than this are not in index used with gsub() to repair irregularities. For example ‘c("t”", "t’")‘ will transform 'Burkitt”s' to 'Burkitt’s'

cleanFields

list of regular expressions identifying fields to ignore

Value

instance of DocSet

Examples

myob = ssrch::docset_cancer68
td = tempdir()
alld = ls(docs2kw(myob))
r1 = retrieve_doc(alld[1], myob)
expo = write.csv(r1, paste0(td, "/expo.csv"))
parseDoc(paste0(td, "/expo.csv"), doctitle=ssrch::titles68[alld[1]])

[Package ssrch version 1.0.0 Index]