Predicting the returns of orders for a retail shoe seller

Thèse : Predicting the returns of orders for a retail shoe seller. Rechercher de 54 000+ Dissertation Gratuites et Mémoires

Par jiwei • 10 Juin 2018 • Thèse • 2 352 Mots (10 Pages) • 954 Vues

Page 1 sur 10

TSIA-SD 210 Challenge 2018

Group member: XIA Jin

YI Muyang

ZHANG Jiwei

Data prepossessing

Raw training data

At first, we only use the data from the train.csv, without products.csv and customers.csv. Using the random forest with default hyper parameter, we get a classifier who has a training accuracy around 65% and test accuracy 60%. In addition, the features such as LineItem, UnitPMPEUR, TotalLineItems have high importance in random forest.

Use products.csv and customers.csv

To get more features, we use data in products.csv and customers.csv to get more data. We combine the table X_train and X_test with products.csv on key VariantId to have more information about the products. We combine the table X_train, X_test with customers.csv on key CustomerId to have more information about the customers.

There are some features such as ProductColorId, CustomerId, etc which have lots of different values. Thus, encoding them with one hot will result in a big increase of unnecessary features. We can’t encode them using one hot encoding, so we removed them at first. We will use them after. These features are OrderCreationDate, OrderNumber, VariantId, CustomerId, OrderCreationDate, OrderShipDate, BillingPostalCode, BrandId, ProductId, ProductColorId, SupplierColor.

There are some data missing. The given function “funk_mask” will create a new column for these NaN values with one hot encoding: the new column is “featureName_nan”, and its value is 1 if the data is missing.

We think that for columns which have a float number, such as MaxSize, MinSize, etc, it might be meaningful to replace the missing data by the mean of all the column value instead of creating a new NaN columns. So for columns PurchasePriceHT, CalfTurn, UpperHeight, MinSize, MaxSize, HeelHeight, BirthDate, FirstOrderDate, we calculate their average and replace the missing data with them.

After combining all the table, removing some columns and filling missing data, we get a test accuracy of 68% using default random forest.

Useful Method

Method to compare feature importance

The method in random forest “feature_importances” give us the importance of a feature.

We implement a faster method to evaluate one feature. We know that the average return rate is 0.2077 in the training set. Every time we select all the samples which satisfy some conditions of one feature. calculating the return rate among these data can give us a rough idea of how important the feature is. For example, if the rate is far away from 0.2077, it could be an important feature.

[pic 1]

Method to evaluate a classifier

We can use cross-validation function to evaluate a classifier

We use a simpler method to evaluate a classifier. Each time we select 100,000 samples, 60,000 training data and 40,000 test data. We train the classifier on the training set. We can have a rough idea of the classifier after training on a small data.

Feature engineering

Remove some features

The CountryISOCode in the customer.csv is almost the same as ISOCode in the order table. The SeasonLabel in prodcuts.csv is almost the same as in the order table. We remove them from the table.

We write a function to remove the columns who have only one value.

In addition, aper our observation from random forest, the DeviceTypeLabel have a null importance. We remove them from the table.

Is the same gender

If a customer buys a gift for others, it is possibly more likely to be returned. Likewise, if a man buys a gift for a woman or on the opposite, it’s more likely that the gift is not suitable enough. For this reason, we create a new feature is_same_gender to show whether the buyer buys a product of the same gender. If we don’t know the gender label of the product, e.g. it’s marked “Sac”, we will fill with “unknown” in this column.

Size description

There is a size description for each product, showing whether the product has a size preference. The size preference has in general 3 types, the real size is smaller, bigger or normal. However, there are 15 descriptions. We cluster them in 4 classes, “big”, “small”, “normal” and “None”.

Ship time

We calculate by (OrderShipDate minus OrderCreationDate) to get the ship time.

Max-Min Size

For one types of shoes, if the size has a large variance, there will be a bigger chance of errors. For example, a client commands a 42 size shoes, but get a 41 size. We create a new feature “max_min”, by calculating MaxSize minus MinSize.

The age of a client could also be an important feature. A young client maybe doesn’t prefer to return the products. We calculate it by OrderCreationDate minus Birthdate.

Shopping age

An experienced client might make less mistake when online shopping. We calculate shopping age by OrderCreationDate minus FirstOrderDate.

Compared price

We have the real transaction price in the train table and the official price in the product. We observed that if a client buys a product less expensive than the official price, the product is less likely to be returned, about 15% return rate. We add the compared price using PurchasePriceHT minus UnitPMPEUR. This is a very important feature; it increases the accuracy to 70%

For these four features (BrandID, ProductColorID, SupplierColor, ProductID), their values are a sequence of numbers. At first, we think it’s meaningless, so we simply remove them. But after a while, we find that they should be usefed based on pre-knowledge on this problem. Since that the color or brand of a product does influence its return rate. So, we want to find a method to implement these four features in a reasonable way.

The vacancy of these four features is 377620/1067290, there are 689670 of the data is useful if we implement these features. This is quite an amount of data.

BrandID

The number of different brand id is 665. It’s not a big number and this is good for our implementation to realize.

We calculate the return rate for every brand. Then we check if the number of products bought of this brand is more than 200. This number is empirical. If it is, the return rate is meaningful, because there’s enough sample. If not, the return rate may be trivial, we throw the return rate calculated before and use an average return rate instead.

Finally, for the feature brand id, either we map the id to its return rate, other we use the average return rate to implement.

ProductColorID

The number of different product id is 52386. This is much larger than that of brand id. There’s only 289 of them where the number of products bought of this product color id is more than 100 (even though we decrease the threshold from 200 to 100, it’s not helpful to the situation). This brings problem for our implementation. It is not reasonable for us to use the same idea as brand id.

We calculate the return rate for every product color id. Then we check if the number of products bought of this product color id is more than 100. If it is, we continue the next step, if not, we simply skip this product color id.

If the number passes the threshold, the calculated return rate is considered meaningful. We note 1, 0, -1 for three situations: 1 for the return rate that is higher than the average return rate by at least 5%, -1 for the return rate that is lower than the average return rate by at least 5%, and 0 for the other cases.

Finally, for the feature product color id, the value is 1, 0 or -1.

SupplierColor

The operation of supplier color is the same as product color id

Finally, for the feature supplier color, the value is 1, 0 or -1.

ProductID

The operation on product id is the same as product color id

Finally, for the feature product id, the value is 1, 0 or -1.

The conclusion of the feature engineering: we get 299 features at final.

Hyper parameter

Sample weight

We observed that the labels are not balanced. The true label versus false label is about 1:3.8. To avoid a preference on the false label, we add a sample weight {1:3.8} to the classifier. However, we don’t find an increased performance on the classifier in practice.

...

Télécharger au format txt (14.4 Kb) pdf (149.5 Kb) docx (59 Kb)

Voir 9 pages de plus »

Uniquement disponible sur DissertationsEnLigne.com

Lire le document complet Enregistrer

Predicting the returns of orders for a retail shoe seller

prev next

Signaler un document

Documents relatifs

Histoire du son analyse sur the night of the hunter de charles laughton
te traditionnel tel qu'il pourrait paraître à une première lecture. Deux enfants, Pearl et John, dépositaires d'une fortune cachée, sont poursuivis par un faux prêcheur

23 Pages • 2140 Vues
The Effectiveness Of Art Therapy And Guided Imagery In Reducing The Stress Of 3Rd Year And 4Th Year Bs Psychology Students Sy 2006-2007
each day. By then, she is now able to perform functional activities she was able to do before, like swallowing, speaking and becoming in

3 Pages • 1954 Vues
In the shadow of the mountain

2 Pages • 1667 Vues
The cult of the faceless boss
nd visionary. The good leader knows how to delegate when necessary. He also has particular expertise in management, accounting, finance, trade and communications. He’s versatile

2 Pages • 1773 Vues
The djujement of Paris

1 Pages • 1472 Vues
The picture of Dorian Gray de Oscar WilDe
is la réalité est toute autre. D’un caractère mou et docile, Dorian se laisse mener par le peintre Basil. En réalité c’est l’artiste lui-même qui

11 Pages • 1882 Vues
Browsing architecture with presentation metadata for the internet of things
wever, these researches have dedicated to enable i

2 Pages • 1511 Vues
The three social issues at the time of Gandhi

2 Pages • 1469 Vues
Synthèse : the globalization of markets ( t.levitt )
e communication du produit. ● Quand les entreprises commercialisent et distribuent des produits adaptés aux besoins spécifiques d’un segment en particulier, elles doivent faire une

6 Pages • 1665 Vues
The Impact of Forest Certification on Firm Financial Performance in Canada and the U.S
ilisés. Elle permet de communiquer les informations environnementales aux consommateurs sur les ressources forestières. Elle a aussi le potentiel d'être un outil d'information pour les

12 Pages • 1795 Vues
The Organization of the Judicial System in the Kingdom of Morocco
s chambres réunies en assemblée plénière. 2. Attributions Les attributions de la Cour Suprême sont nombreuses et diversifiées. La loi a cependant limité son rôle

8 Pages • 1395 Vues
The rise of artisan bakeries in UK
The rise of artisan bakeries This article entitled « The rise of artisan bakeries » from Sudi PIGOTT was published in “The Indepedant”. It describes

3 Pages • 1518 Vues
Anglais, The purpose of testing the data variables for Group
The purpose of testing the data variables for Group #3 (4 members) was to check and confirm the main hypothesis that has been raised, which

6 Pages • 1584 Vues
Spacies and exchanges : How is immigration a positive form of exchange for the USA but as the same time criticized and badly viewed ?
The notion i am going to deal with is spacies and exchanges. To begin i’d like to give a définition of the notion :an

3 Pages • 1105 Vues