python - Computing separate tfidf scores for two different columns using sklearn -

- March 15, 2014

i'm trying compute similarity between set of queries , set result each query. using tfidf scores , cosine similarity. issue i'm having can't figure out how generate tfidf matrix using 2 columns (in pandas dataframe). have concatenated 2 columns , works fine, it's awkward use since needs keep track of query belongs result. how go calculating tfidf matrix 2 columns @ once? i'm using pandas , sklearn.

here's relevant code:

tf = tfidfvectorizer(analyzer='word', min_df = 0) tfidf_matrix = tf.fit_transform(df_all['search_term'] + df_all['product_title']) # line issue feature_names = tf.get_feature_names()

i'm trying pass df_all['search_term'] , df_all['product_title'] arguments tf.fit_transform. not work since concatenates strings not allow me compare search_term product_title. also, there maybe better way of going this?

you've made start putting words together; simple pipeline such enough produce results. can build more complex feature processing pipelines using pipeline , preprocessing. here's how work data:

import pandas pd sklearn.feature_extraction.text import tfidfvectorizer sklearn.preprocessing import functiontransformer sklearn.pipeline import featureunion, pipeline  df_all = pd.dataframe({'search_term':['hat','cat'],                         'product_title':['hat stand','cat in hat']})  transformer = featureunion([                 ('search_term_tfidf',                    pipeline([('extract_field',                               functiontransformer(lambda x: x['search_term'],                                                    validate=false)),                             ('tfidf',                                tfidfvectorizer())])),                 ('product_title_tfidf',                    pipeline([('extract_field',                                functiontransformer(lambda x: x['product_title'],                                                    validate=false)),                             ('tfidf',                                tfidfvectorizer())]))])   transformer.fit(df_all)  search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names()  product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names() vocab = search_vocab + product_vocab  print(vocab) print(transformer.transform(df_all).toarray())  ['cat', 'hat', 'cat', 'hat', 'in', 'stand']  [[ 0.          1.          0.          0.57973867  0.          0.81480247]  [ 1.          0.          0.6316672   0.44943642  0.6316672   0.        ]]

Search This Blog

First Image

python - Computing separate tfidf scores for two different columns using sklearn -

Comments

Post a Comment

Popular posts from this blog

php - Passing multiple values in a url using checkbox -

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -