Determine sell price for secondhand luxury bags
An end to end project from collecting data, creating data pipeline to a business intelligence web application.
This is a project about giving people a recommendation on how to set a price for their used luxury bag. I started the project by creating a data pipeline of extracting data using web scrapping, transforming, cleaning data using pandas, and loading the data for BI web application. From this project, I can understand the price range for each bag, and I hope that the project can give users some reference and make a quick decision about how to price secondhand luxury bags.
The secondhand market has been booming for years. As more customers award of the fast-fashion industry has a huge impact on the environment, people started to be concise toward their purchase behavior. With the convenience of the online secondhand marketplace, such as Poshmark, customers can buy or sell secondhand items without leaving home.
For my family and myself, every time when I want to declutter my staff, I always try to post items in the online marketplace first. However, the process of selling used items is daunting. From taking pictures, setting the price, writing descriptions, and posting online, all events take time. Especially decide the price for the post, I need to browse different websites or search some similar listing, try to find the reasonable price for my products.
One day, an idea comes to my mind, why not build a tool to give me an overview of the market and a reference for me. As a result, I decided to start with luxury secondhand bags.
Give users more information about the product’s price range and recommend prices for users to label their used items for selling on online platforms.
I would briefly share what tools and steps I used for the project. Feel free to reach out to me if you have any questions.
Tools I use
Web Scraping: Beautiful Soup
Create Data Pipeline, Data Cleaning, and Preprocessing: Pandas, Numpy
Exploratory Data Analysis: Matplotlib, Seaborn, Plotly
Machine Learning Model: Scikit-learn
Web application: AWS, Tableau
I will break down the section by creating a data pipeline, exploratory data analysis, and training the model.
The Data Pipeline
Web scraping is a useful technique that allows the extraction of data from different sources. Besides, sometimes, it is hard to access specific data you are interested in unless the website provides API such as Twitter or open-source data.
Before jumping into web scraping, it is important to check a site’s robot.txt file. The file contains that information that tells users which pages they can or can’t request from the site.
The pages I decide to scrap are “browse” pages.
Below is the code I used to scrape the data.
Since I need to grab a lot of information from different pages for the same style and different styles. Creating functions for each step is important to speed up the process.
Data cleaning and preprocessing
It is the step that I spend most of my time on. Since user input varies, it is hard to categorize raw data into groups. It requires research and some business context to clean up the data.
Currently, I have four columns in my dataset. Name, price, size, and brand. First, I want to start with the price column. When I first explore the data, I noticed that there is a lot of prices of 0 or the price is low. I decided to look back to the website to gain more context. After research more information and based on the description I saw, I noticed that price 0 items are used as a placeholder for the seller. The price is low most of them are not authentic. As a result, I decided to remove the price < 500’s data first.
As for size, although some of the sizes are the same as the description on the official website, some of them just as size or description such as medium. I decided to set some keywords for each size and categorized them into their group.
I have shared more detail about how to clean up the data in my notebook. My main takeaway in the step is that EDA is really important here! Through exploring the data, I can know the outliner, some interesting points.
Business Intelligence Dashboard
For the recommendation dashboard, I decided to break it down into three sections. First, some big numbers for summary statistics. Second, the histogram for price distribution, and last but not least, and box plot for graphically display a variable’s spread at a glance. Users can filter bag style and size from the drop-down menu. Besides, I have added download as image or pdf options for users to store the result for future reference.
For this project, I decided to try AWS S3 for deployment. I created an S3 bucket for hosting the website. I followed along with this tutorial.
In the project, I faced a lot of challenges.
- Originally, I want to train a model to predict the price. However, since I don’t have enough data and columns to train, the result didn’t looks reflect the truth. The R2 of the model is 62%. But due to the limitation of data I have, I decide to use descriptive analysis to give price recommendations. It would be helpful if I have the condition, color, more style detail of the bag.
- The data cleaning process requires more business context. Since the user’s input is not standardized, thus it is hard to label them. I recommend limiting user input for size.
While working on this project, I noticed some interesting points that secondhand online platforms might want to consider in the future.
Fake listing detection — During the data cleaning process, there are a lot of low price listings, and most of them have written that not authentic, etc. Some of the listings are from users who don’t have a lot of reviews, or their listing looks similar — pictures characteristics like very low resolution.
I am researching more platform’s data. My next step is to combine all different data sources and add more attributes to my datasets.