
Mercari is Japan’s biggest community-powered shopping app whose offices are situated in Japan and the United States. With this, we can expect an e-commerce related dataset from the based on Japan or US products and customers data.
The files consist of a list of product listings. These files are tab-delimited (.tsv files)
train_id or test_id – ID of the listing
name – Title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
Evaluation Metric: Root Mean Squared Logarithmic Error item_condition_id – Condition of the items provided by the seller
category_name – Category of the listing
brand_name – Name of Brand
price – The price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict.
shipping – 1 if shipping fee is paid by seller and 0 by buyer
item_description – Full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
Evaluation Metric: Root Mean Squared Logarithmic Error
Problem Statement:
Mercari wants to offer pricing suggestions to sellers, based solely on descriptions of the products and some additional categorical attributes. Sellers offer a variety of new/used products of different brands.The files consist of a list of product listings. These files are tab-delimited (.tsv files)
train_id or test_id – ID of the listing
name – Title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
Evaluation Metric: Root Mean Squared Logarithmic Error item_condition_id – Condition of the items provided by the seller
category_name – Category of the listing
brand_name – Name of Brand
price – The price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict.
shipping – 1 if shipping fee is paid by seller and 0 by buyer
item_description – Full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
Evaluation Metric: Root Mean Squared Logarithmic Error

Where:
ϵ is the RMSLE value (score)
n is the total number of observations in the (public/private) data set,
p_i is your prediction of price, and
a_i is the actual sale price for i.
log(x) is the natural logarithm of x
Why Log?
As in figure below the price distribution is skewed towards the right, so to maintain a nice variance/spread among price values it is suggestible to predict values using logarithms. Since log(0) is undefined, so we add 1 to all the values.
ϵ is the RMSLE value (score)
n is the total number of observations in the (public/private) data set,
p_i is your prediction of price, and
a_i is the actual sale price for i.
log(x) is the natural logarithm of x
Why Log?
As in figure below the price distribution is skewed towards the right, so to maintain a nice variance/spread among price values it is suggestible to predict values using logarithms. Since log(0) is undefined, so we add 1 to all the values.

Overview:
- Importing Data
- Exploratory Data Analysis
- Exploring Categorical Data
- Exploring Numerical Data
- Feature Extraction, Preprocessing and Imputation
-
- Checking for Duplicate data
- Checking for Null values
- Feature Extraction
- Text preprocessing
- Imputation
- Vectorization and Training Models
- Removing Outliers
- Vectorizing Categorical data
- Vectorizing Numerical data
- Training Models
- Summary
Importing Data
Data Source : https://www.kaggle.com/c/mercari-price-suggestion-challenge/data
There is no duplicate data.
Basic information about Dataset.
Basic information about Dataset.

From the above information Null values are present in 3 features as below:
Category_name has 6327 Null values (0.4% of data is not present)
Brand_name has 632,682 Null values (42% of data is not present)
Item_description has only 4 Null values
Category_name has 6327 Null values (0.4% of data is not present)
Brand_name has 632,682 Null values (42% of data is not present)
Item_description has only 4 Null values
Exploratory Data Analysis
Exploring Categorical Data
category_name:
Observation : Users are more interested in Women apparels,Beauty products and Electronics on this site
Item_conditon_id:
Item_conditon_id:

Observation : Item_condition_id 5 has the least no. of observations/records which covers only 0.1% of total data
shipping:
shipping:

Observation : mostly shipping fee is paid by buyer
brand_name:
There are total 4809 brands, below are top 10 brands
brand_name:
There are total 4809 brands, below are top 10 brands

Since PINK,Nike and Victoria’s Secret are top most brands in the dataset let’s check what kind of products they provide

PINK provides mostly Womens apparels

Nike provides Womens and Kids sports apparels and mostly Shoes

Victoria’s Secret provides Women’s Beauty Products and Apparels
Exploring Numerical Data
Price:


Observation :
Correlations among features
- Considering variety of product, there are more than 80% of products costs less than $50
- Logarithm of prices has better distribution and spread, compared to distribution of prices only.
- Logarithmic based metric like Root mean squared logarithmic Error can be used to predict the accuracy of a model.
Correlations among features



- From the above correlations, It is observed that item condition, shipping both are inversely proportional to price.
- It can be interpreted as item_condition_id increases from 1-5 the price may decrease i.e, 1 is the worst condition and 5 is the best condition.
- Since the observations are based on a variety of products, it is suggestable to find correlation based on the varieties i.e, category names or brand names.
Correlation based on Brands

Correlation based on Categories

- From the above correlations based on brand names and categories It is observed that item condition, shipping both are inversely proportional to price.
- It can be interpreted as item_condition_id increases from 1-5 the price may decrease i.e, 1 is best condition and 5 is worse condition
- shipping increases from 0-1 i.e, if shipping fees are paid by buyer then price gets increased.
Feature Extraction, Text Preprocessing and Imputation
Checking for Duplicate Data:



Observation :
- Dataset contains no duplicates records
- When name, item_description are considered 0.5% data are duplicates
- When name, item_description,price are considered 0.16% data are duplicates
Checking for Null Data:


Observation:
- 5% of records has no item description
- 42% of records has no brand names
- 0.4% of records has no categories
Feature Extraction – summary:
name and item_description are concatenated
Feature Extraction – polarity, num_sentence, num_variety_count, num_value_sum, quantity_value, date_ind:
Here textblob and spacy Libraries are used to extract these features- polarity = polarity score of ‘summary’ using ‘textblob’ library
- num_sentence = number of sentences in summary
- num_variety_count = no. of varieties of numerical data (cardinals, ordinals, money etc.)
- num_value_sum = sum of the values of numerical data
- quantity_value = quantity of product in that summary
- date_ind = check if date is present then 1 else 0 (Date indicator)

The polarity of item_description can determine how good the condition of the product could be and it could be correlated with item_condition_id.

The polarity of an item description is found using ‘Textblob’ library before preprocessing of text data because the text may contain stopwords, special symbols which express the seller viewpoint in a stronger way and we can get True polarity of text.
Observation: Polarity values are high whose item_condition_id is 1 and low whose item_condition_id is 5 i.e, based on description provided by users products with item_condition_id 5 has poor condition.
Observation: Polarity values are high whose item_condition_id is 1 and low whose item_condition_id is 5 i.e, based on description provided by users products with item_condition_id 5 has poor condition.
Feature Extraction : cos_sim
cosine similarity of Title and Item description using spacy Library
Feature Engineering : num_avg
avg of numerical values in summary text.num_avg = num_value_sum / num_variety_count

Feature Engineering : log_num_avg
logarithmic value of ‘num_avg’ featurelog_num_sum = log(num_avg+1)

Imputation – brand_names:
Since Brand_name has the most null values and also Brand can define the price of a product, it is essential to find ways to fill up the empty Brand_names.- One approach is to find the Brand names in ‘item_description’ and ‘name’ features. The values of name and Item_descrption are joined and stored into a new feature named ‘summary’.
- Here I Stored all the brand names in a set and searched for brand names in each ‘summary’, a brand change indicator is also created to check if the brand is provided(0) or extracted(1).
- Sometimes Seller may not put much effort to write appropriate brand name


So we are searching brand names by looking deeper into ‘Summary’ feature.
Here we are removing stopwords and special characters in Brands and summary feature and adding into a set which is used to find brand names in summary feature.
Feature Extraction : brnd_change
It is an indicator if ‘brand_name’ was added during imputation.
Text Preprocessing : brand_name
removing special characters in ‘brand_name’ feature
Text Preprocessing : summary
removing special characters,stopwords in ‘summary’ feature
Imputation – brand_name (deep) :
Users may put brand names in item description without any punctuations, symbols so here we are searching and imputing brand names after preprocessing of brand_name and summary data.Imputation : brnd_change
It is an indicator if ‘brand_name’ was added by imputation process

15000 more Brand names were found and added in brand_name.i.e, we added almost 1% of brand name data
Text Preprocessing : summary
lemmatization of preprocessed ‘summary’ feature
Imputation – category_name :
filling category names based on number of words match with the category_nameeach category is filled based on brand_name
eg: for ‘adidas’ brand
- step 1 : Select records whose brand name is ‘adidas’ and has no category.
- step 2 : create a list of set of category_names obtained from step 2.
- step 3 : create a set of summary words and iterate over each element obtained from step 3. We are intersecting each element and finding the length of common words between category name and summary.
- step 4 : highest no. of common words among categories in step 3 can have the category name for that record.
- step 5 : If there are no common words then check if the brand name is manually changed or not from ‘brnd_change’. If changed then change the remove brand name and check its category again among all the products and categories.
- step 6 : In case of ‘nan’ brand_names, all the categories are checked and matched along with summary, to find a suitable category for the summary

- Checking Null categories Brand wise. If the Brand has any null categorical_names, we check and collect which kind of products (categorical_names of the brand)does the particular brand offers as a list of sets.
- Split the summary of the record into a set and perform set intersection with set of categorical_names i.e, finding no. of common words between category and summary. The highest no. of common words between category and summary, the summary will fall under that category of the selected brand.
Feature Extraction : category_change
It is an indicator if ‘category_name’ was added during imputation.
Results after imputing category_name :

Feature Engineering : item_condition_id
converting numerals to strings, since it has only 5 categories in numerical form.
Feature Extraction Summary
- summary = concatenated name and item_description
- polarity = polarity score of ‘summary’ using ‘textblob’ library
- num_sentence = number of sentences in summary
- num_variety_count = no. of varieties of numerical data
- num_value_sum = sum of the values of numerical data
- quantity_value = quantity of product
- date_ind = if date is present then 1 else 0 (Date indicator)
- category_change = if the category is manually added after preprocessing then 1 else 0 (genuine category indicator )
- brnd_change = if the brand is manually added after preprocessing then 1 else 0 (genuine brand indicator )
- cos_sim = cosine similarity of Title and Item description using spacy Library
Feature Engineering / Imputation Summary
- Imputation : brand_name
finding brand names in summary text - Text preprocessing : summary
removing stopwords, special characters and lemmatization of preprocessed summary - Text preprocessing : brand_name
removing stopwords, special characters in brand names - Imputation : brand_name (deep)
finding brand names after preprocessing of brand names and summary text - Feature Engineering : brnd_name,brand_change
regularization of modified brand names, removing irrelevant brand names and resetting the change indicator of brand names while imputing ‘category_name’. - Feature Engineering : num_avg
avg of numerical values in summary text.
num_avg = num_value_sum / num_variety_count - Feature Engineering : log_num_avg
logarithmic value of ‘num_avg’ feature - Feature Engineering : item_condition_id
converting numerals to strings, since it has only 5 categories in numerical form.
Vectorization and Training Models
Removing Outliers
Creating Final dataframe with necessary features
The Outliers may possess two cases:
- If the price is 0 then the seller is selling the item for Free
- If the price is too high then the probability of selling the item will be less.


Observation :
- There are 874 records whose price is 0.i.e, the item is Free.
- 5% of records are eliminated based on 2nd Standard Deviation of Price data. This helps to deal with extreme outlier data like, price=0/higher prices.
Train-Test Split:
75% of data is dedicated to training.
Vectorizing Categorical data
category_name:Here a custom token is created, which creates tokens based on splitting text by ‘/’, for the situations below
eg : 1a. Handmade/Patterns/Accessories has 1 item in dataset
1b. Handmade/Patterns/Baby has 1 item in dataset
There could be a record which may have Accessories/ Baby/ {any other non-frequent category} in category field, but we could predict product prices which are ‘Handmade/Patterns’ even if there is any other non-frequent category.

brand_name:

Item_condition_id:
Since there are only 5 categories and single character values between 1-5, It is vectorized based on character

summary:
Vectorizing summary text using TfidfVectorizer with min_df = 10.

shipping:
Binary category with numerical values[0,1]
x_train[‘shipping’].values.reshape(-1,1)
x_test[‘shipping’].values.reshape(-1,1)
brnd_change:
Binary category with numerical values[0,1]
x_train[‘brnd_change’].values.reshape(-1,1)
x_test[‘brnd_change’].values.reshape(-1,1)
category_change:
Binary category with numerical values[0,1]
x_train[‘category_change’].values.reshape(-1,1)
x_test[‘category_change’].values.reshape(-1,1)
date_ind: Binary category with numerical values[0,1]
x_train[‘date_ind’].values.reshape(-1,1)
x_test[‘date_ind’].values.reshape(-1,1)
Vectorizing Numerical data
All the Numerical data is standardized, below are the details.polarity_scalar : Standardized ‘polarity’ feature
num_sent_scalar : Standardized ‘num_sentence’ feature
num_var_scalar : Standardized ‘num_variety_count’ feature
qty_scalar : Standardized ‘quantity_value’ feature
log_scalar : Standardized ‘log_num_sum’ feature
cos_scalar : Standardized ‘cos_sim’ feature
Evaluation Metric
Custom Evaluation Metric, Root Mean Squared Logarithmic Error (RMSLE) is created as this is not present in scikit-learn’s scoring parameter . Please refer to the link for more details regarding this metric.
Training Models
Hyperparameter Tuning of modelsFunction definitions for GridSearchCV and RandomSearchCV were made which takes training data, model and its parameters as input and returns Data Frame of cross-validation scores.
Function definition for Grid search cross-validation:

Function definition for Random search cross-validation:

Below Models are used for training the data and finding the best Hyperparemeter to get the best score i.e, the least RMSLE
- Linear Regression
- Ridge Regression
- XGBoost Regressor
- XGBoost Random Forest Regressor
- RandomForest
- Wordbatch NN_ReLU_H1
Feature Set 1 :cat_train_bow, brand_train_bow, summ_train_bow,item_cond_train_std, shipp_train_std, polarity_train_std, brnd_change_train_std,cat_change_train_std, num_sent_train_std,num_val_train_std,dti_train_std
Feature Set 2 :cat_train_bow, brand_train_bow, summ_train_bow,item_cond_train_std,cos_train_std, shipp_train_std, polarity_train_std, brnd_change_train_std,cat_change_train_std, num_sent_train_std,num_var_train_std,num_val_train_std,qty_train_std,dti_train_std,log_train_std
Feature Set 3 :cat_train_bow, brand_train_bow, summ_train_bow,item_cond_train_bow,cos_train_std, x_train[‘shipping’].values.reshape(-1,1), polarity_train_std, x_train[‘brnd_change’].values.reshape(-1,1), x_train[‘category_change’].values.reshape(-1,1),num_sent_train_std,num_var_train_std, qty_train_std,x_train[‘date_ind’].values.reshape(-1,1),log_train_std
Below are the Performances of each model, Click on Tab of each model
Linear Regression
Training Description | Best Parameters | RMSLE |
---|---|---|
Feature Set 1 | – | 0.75897 |
Feature Set 2 | – | 0.68443 |
Feature Set 3 | – | 0.60311 |
Feature Set 3 95% data |
– | 0.43815 |
Observation : WordBatch Model has the best Root Mean Squared Logarithmic Error of 0.39593
Future work
- Add Word2Vec features, Word2Vec features can be obtained from gensim, Spacy LIbraries and train data.
- Train data using Attention Model and check results.
- Implement Coefficient of variation to get more details and better understand the spread of data.
- Use Coefficient of determination to check in depth dependencies of dependent variable with independent variable.
Conclusion
Mercari provided sellers data with an aim to find a solution to suggest an appropriate price for sellers based on product. My aim is to focus more on retrieving information of data and using various feature engineering techniques based on the information retrieved which can help to get a good RMSLE score.The lesson I learnt is take more time to look into raw data and imagine yourself into that problem, so that it will be easier to make assumptions, analogies, situations etc.
Sources
https://www.appliedaicourse.com/https://towardsdatascience.com/predict-product-success-using-nlp-models-b3e87295d97
https://towardsdatascience.com/build-and-compare-3-models-nlp-sentiment-prediction-67320979de61
https://www.youtube.com/watch?v=QFR0IHbzA30
https://viblo.asia/p/predict-independent-values-with-text-data-using-linear-regression-aWj5314eZ6m