python - Computing separate tfidf scores for two different columns using sklearn -
i'm trying compute similarity between set of queries , set result each query. using tfidf scores , cosine similarity. issue i'm having can't figure out how generate tfidf matrix using 2 columns (in pandas dataframe). have concatenated 2 columns , works fine, it's awkward use since needs keep track of query belongs result. how go calculating tfidf matrix 2 columns @ once? i'm using pandas , sklearn.
here's relevant code:
tf = tfidfvectorizer(analyzer='word', min_df = 0) tfidf_matrix = tf.fit_transform(df_all['search_term'] + df_all['product_title']) # line issue feature_names = tf.get_feature_names()
i'm trying pass df_all['search_term'] , df_all['product_title'] arguments tf.fit_transform. not work since concatenates strings not allow me compare search_term product_title. also, there maybe better way of going this?
you've made start putting words together; simple pipeline such enough produce results. can build more complex feature processing pipelines using pipeline
, preprocessing
. here's how work data:
import pandas pd sklearn.feature_extraction.text import tfidfvectorizer sklearn.preprocessing import functiontransformer sklearn.pipeline import featureunion, pipeline df_all = pd.dataframe({'search_term':['hat','cat'], 'product_title':['hat stand','cat in hat']}) transformer = featureunion([ ('search_term_tfidf', pipeline([('extract_field', functiontransformer(lambda x: x['search_term'], validate=false)), ('tfidf', tfidfvectorizer())])), ('product_title_tfidf', pipeline([('extract_field', functiontransformer(lambda x: x['product_title'], validate=false)), ('tfidf', tfidfvectorizer())]))]) transformer.fit(df_all) search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names() product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names() vocab = search_vocab + product_vocab print(vocab) print(transformer.transform(df_all).toarray()) ['cat', 'hat', 'cat', 'hat', 'in', 'stand'] [[ 0. 1. 0. 0.57973867 0. 0.81480247] [ 1. 0. 0.6316672 0.44943642 0.6316672 0. ]]
Comments
Post a Comment