Predicting the returns of orders for a retail shoe seller
Thèse : Predicting the returns of orders for a retail shoe seller. Rechercher de 53 000+ Dissertation Gratuites et MémoiresPar jiwei • 10 Juin 2018 • Thèse • 2 352 Mots (10 Pages) • 749 Vues
TSIA-SD 210 Challenge 2018
Predicting the returns of orders for a retail shoe seller
Group member: XIA Jin
YI Muyang
ZHANG Jiwei
Data prepossessing
- Raw training data
At first, we only use the data from the train.csv, without products.csv and customers.csv. Using the random forest with default hyper parameter, we get a classifier who has a training accuracy around 65% and test accuracy 60%. In addition, the features such as LineItem, UnitPMPEUR, TotalLineItems have high importance in random forest.
- Use products.csv and customers.csv
To get more features, we use data in products.csv and customers.csv to get more data. We combine the table X_train and X_test with products.csv on key VariantId to have more information about the products. We combine the table X_train, X_test with customers.csv on key CustomerId to have more information about the customers.
There are some features such as ProductColorId, CustomerId, etc which have lots of different values. Thus, encoding them with one hot will result in a big increase of unnecessary features. We can’t encode them using one hot encoding, so we removed them at first. We will use them after. These features are OrderCreationDate, OrderNumber, VariantId, CustomerId, OrderCreationDate, OrderShipDate, BillingPostalCode, BrandId, ProductId, ProductColorId, SupplierColor.
There are some data missing. The given function “funk_mask” will create a new column for these NaN values with one hot encoding: the new column is “featureName_nan”, and its value is 1 if the data is missing.
We think that for columns which have a float number, such as MaxSize, MinSize, etc, it might be meaningful to replace the missing data by the mean of all the column value instead of creating a new NaN columns. So for columns PurchasePriceHT, CalfTurn, UpperHeight, MinSize, MaxSize, HeelHeight, BirthDate, FirstOrderDate, we calculate their average and replace the missing data with them.
After combining all the table, removing some columns and filling missing data, we get a test accuracy of 68% using default random forest.
Useful Method
- Method to compare feature importance
- The method in random forest “feature_importances” give us the importance of a feature.
- We implement a faster method to evaluate one feature. We know that the average return rate is 0.2077 in the training set. Every time we select all the samples which satisfy some conditions of one feature. calculating the return rate among these data can give us a rough idea of how important the feature is. For example, if the rate is far away from 0.2077, it could be an important feature.
[pic 1]
- Method to evaluate a classifier
- We can use cross-validation function to evaluate a classifier
- We use a simpler method to evaluate a classifier. Each time we select 100,000 samples, 60,000 training data and 40,000 test data. We train the classifier on the training set. We can have a rough idea of the classifier after training on a small data.
Feature engineering
- Remove some features
The CountryISOCode in the customer.csv is almost the same as ISOCode in the order table. The SeasonLabel in prodcuts.csv is almost the same as in the order table. We remove them from the table.
We write a function to remove the columns who have only one value.
In addition, aper our observation from random forest, the DeviceTypeLabel have a null importance. We remove them from the table.
- Is the same gender
If a customer buys a gift for others, it is possibly more likely to be returned. Likewise, if a man buys a gift for a woman or on the opposite, it’s more likely that the gift is not suitable enough. For this reason, we create a new feature is_same_gender to show whether the buyer buys a product of the same gender. If we don’t know the gender label of the product, e.g. it’s marked “Sac”, we will fill with “unknown” in this column.
- Size description
There is a size description for each product, showing whether the product has a size preference. The size preference has in general 3 types, the real size is smaller, bigger or normal. However, there are 15 descriptions. We cluster them in 4 classes, “big”, “small”, “normal” and “None”.
- Ship time
We calculate by (OrderShipDate minus OrderCreationDate) to get the ship time.
- Max-Min Size
For one types of shoes, if the size has a large variance, there will be a bigger chance of errors. For example, a client commands a 42 size shoes, but get a 41 size. We create a new feature “max_min”, by calculating MaxSize minus MinSize.
- Age
The age of a client could also be an important feature. A young client maybe doesn’t prefer to return the products. We calculate it by OrderCreationDate minus Birthdate.
- Shopping age
An experienced client might make less mistake when online shopping. We calculate shopping age by OrderCreationDate minus FirstOrderDate.
- Compared price
We have the real transaction price in the train table and the official price in the product. We observed that if a client buys a product less expensive than the official price, the product is less likely to be returned, about 15% return rate. We add the compared price using PurchasePriceHT minus UnitPMPEUR. This is a very important feature; it increases the accuracy to 70%
For these four features (BrandID, ProductColorID, SupplierColor, ProductID), their values are a sequence of numbers. At first, we think it’s meaningless, so we simply remove them. But after a while, we find that they should be usefed based on pre-knowledge on this problem. Since that the color or brand of a product does influence its return rate. So, we want to find a method to implement these four features in a reasonable way.
The vacancy of these four features is 377620/1067290, there are 689670 of the data is useful if we implement these features. This is quite an amount of data.
- BrandID
The number of different brand id is 665. It’s not a big number and this is good for our implementation to realize.
We calculate the return rate for every brand. Then we check if the number of products bought of this brand is more than 200. This number is empirical. If it is, the return rate is meaningful, because there’s enough sample. If not, the return rate may be trivial, we throw the return rate calculated before and use an average return rate instead.
Finally, for the feature brand id, either we map the id to its return rate, other we use the average return rate to implement.
- ProductColorID
The number of different product id is 52386. This is much larger than that of brand id. There’s only 289 of them where the number of products bought of this product color id is more than 100 (even though we decrease the threshold from 200 to 100, it’s not helpful to the situation). This brings problem for our implementation. It is not reasonable for us to use the same idea as brand id.
We calculate the return rate for every product color id. Then we check if the number of products bought of this product color id is more than 100. If it is, we continue the next step, if not, we simply skip this product color id.
If the number passes the threshold, the calculated return rate is considered meaningful. We note 1, 0, -1 for three situations: 1 for the return rate that is higher than the average return rate by at least 5%, -1 for the return rate that is lower than the average return rate by at least 5%, and 0 for the other cases.
Finally, for the feature product color id, the value is 1, 0 or -1.
- SupplierColor
The operation of supplier color is the same as product color id
Finally, for the feature supplier color, the value is 1, 0 or -1.
- ProductID
The operation on product id is the same as product color id
Finally, for the feature product id, the value is 1, 0 or -1.
The conclusion of the feature engineering: we get 299 features at final.
Hyper parameter
- Sample weight
We observed that the labels are not balanced. The true label versus false label is about 1:3.8. To avoid a preference on the false label, we add a sample weight {1:3.8} to the classifier. However, we don’t find an increased performance on the classifier in practice.
...