dictionary - R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases -


when text mining using r, after reprocessing text data, need create document-term matrix further exploring. in similar chinese, english have phases, such "semantic distance", "machine learning", if segment them word, have totally different meanings, want know how match pre-defined dictionaries values consist of white-space separated terms, such contains "semantic distance", "machine learning". if document "we use machine learning method calculate words semantic distance", when applying document on dictionary["semantic distance", "machine learning"], return 1x2 matrix:[semantic distance, 1;machine learning,1]

it's possible quanteda, although requires construction of dictionary each phrase, , pre-processing text convert phrases tokens. become "token", phrases need joined other whitespace -- here, "_" character.

require(quanteda) packageversion("quanteda") ## [1] '0.9.5.19' 

here example texts, including phrase in op. added 2 additional texts illustration -- below, first row of document-feature matrix produces requested answer.

txt <- c("we use machine learning method calculate words semantic distance.",          "machine learning best sort of learning.",          "the distance between semantic distance , machine learning machine driven.") 

the current signature phrase token requires phrases argument dictionary or collocations object. here make dictionary:

mydict <- dictionary(list(machine_learning = "machine learning",                            semantic_distance = "semantic distance")) 

then pre-process text convert dictionary phrases keys:

txtphrases <- phrasetotoken(txt, mydict) txtphrases ## [1] "we use machine_learning method calculate words semantic_distance." ## [2] "machine_learning best sort of learning."                                 ## [3] "the distance between semantic_distance , machine_learning machine driven." 

finally, can construct document-feature matrix, keeping phrases using default "glob" pattern match feature includes underscore character:

mydfm <- dfm(txtphrases, keptfeatures = "*_*") ## creating dfm character vector ... ##   ... lowercasing ##   ... tokenizing ##   ... indexing documents: 3 documents ##   ... indexing features: 20 feature types ##   ... kept 2 features, 1 supplied (glob) feature types ##   ... created 3 x 2 sparse dfm ##   ... complete.  ## elapsed time: 0.012 seconds.  mydfm ## document-feature matrix of: 3 documents, 2 features. ## 3 x 2 sparse matrix of class "dfmsparse" ##        features ## docs    machine_learning semantic_distance ##   text1                1                 1 ##   text2                1                 0 ##   text3                1                 1 

this clunky of quanteda 0.9.5.19 that's simplest way. once have added multiple token phrase entries dictionary matching (soon!) become easier.


Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -