dictionary - R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases -
when text mining using r, after reprocessing text data, need create document-term matrix further exploring. in similar chinese, english have phases, such "semantic distance", "machine learning", if segment them word, have totally different meanings, want know how match pre-defined dictionaries values consist of white-space separated terms, such contains "semantic distance", "machine learning". if document "we use machine learning method calculate words semantic distance", when applying document on dictionary["semantic distance", "machine learning"], return 1x2 matrix:[semantic distance, 1;machine learning,1]
it's possible quanteda, although requires construction of dictionary each phrase, , pre-processing text convert phrases tokens. become "token", phrases need joined other whitespace -- here, "_
" character.
require(quanteda) packageversion("quanteda") ## [1] '0.9.5.19'
here example texts, including phrase in op. added 2 additional texts illustration -- below, first row of document-feature matrix produces requested answer.
txt <- c("we use machine learning method calculate words semantic distance.", "machine learning best sort of learning.", "the distance between semantic distance , machine learning machine driven.")
the current signature phrase token requires phrases
argument dictionary or collocations object. here make dictionary:
mydict <- dictionary(list(machine_learning = "machine learning", semantic_distance = "semantic distance"))
then pre-process text convert dictionary phrases keys:
txtphrases <- phrasetotoken(txt, mydict) txtphrases ## [1] "we use machine_learning method calculate words semantic_distance." ## [2] "machine_learning best sort of learning." ## [3] "the distance between semantic_distance , machine_learning machine driven."
finally, can construct document-feature matrix, keeping phrases using default "glob" pattern match feature includes underscore character:
mydfm <- dfm(txtphrases, keptfeatures = "*_*") ## creating dfm character vector ... ## ... lowercasing ## ... tokenizing ## ... indexing documents: 3 documents ## ... indexing features: 20 feature types ## ... kept 2 features, 1 supplied (glob) feature types ## ... created 3 x 2 sparse dfm ## ... complete. ## elapsed time: 0.012 seconds. mydfm ## document-feature matrix of: 3 documents, 2 features. ## 3 x 2 sparse matrix of class "dfmsparse" ## features ## docs machine_learning semantic_distance ## text1 1 1 ## text2 1 0 ## text3 1 1
this clunky of quanteda 0.9.5.19 that's simplest way. once have added multiple token phrase entries dictionary matching (soon!) become easier.
Comments
Post a Comment