Human or Robot?

Kaggle, Facebook Recruiting, Machine Learning, Data Wrangling, R, Python

Download as .zip Download as .tar.gz View on GitHub

Human or Robot?

Intro(From Kaggle)

In this competition, I was chasing down robots for an online auction site. Human bidders on the site are becoming increasingly frustrated with their inability to win auctions vs. their software-controlled counterparts. As a result, usage from the site's core customer base is plummeting.

In order to rebuild customer happiness, the site owners need to eliminate computer generated bidding from their auctions. They attempt at building a model to identify these bids using behavioral data.

The goal of this competition is to identify online auction bids that are placed by "robots", helping the site owners easily flag these users for removal from their site to prevent unfair auction activity.

To build better model, I investigated how an online robot bidding site (BidderRobot) worked. You can get a feel how they can help you win an auction by watching their tutorials.

Data

The training dataset has the bidder_id, payment_account, address, all of which are hashed to protect customer privacy. Besides, each bidder in the training dataset has been labelled using other data we couldn't see.

We also have a log file, which is about 1 GB in size, recording each bid, bidder_id, auction id, merchandise category, device used by the bidder, time of bid placed, country of the bidder, ip of the bidder, and the URL from which the bidder get to the auction.

Defect of the data

Feature Engineering

The most important part of this problem is feature engineering. I first checked that each bidder actually had unique payment account and address. So these two features in the training dataset were of NO use.

I engineered about 40 features.

Surprisingly, the minimum time difference between bids made by a user is almost useless. Many of the users, no matter human or robot, have 0 as the minimum time difference. Other Kagglers also found this. I think maybe some errors were introduced when time was hashed. However, the median time between a user's bid and that user's previous bid was found to be useful by some Kagglers (Discussions Here).

The most useful features I found were: the total number of bids, the number of device, the number of ip, median number of bids over an auction, maximum number of bids over an auction, the ratio of device_number to auction_number, the ratio of median bidding number to the ip number, the ratio of total number of bids to the number of auctions participated.

Results

My final model was trained using Random Forest with a public score 0.86815 and a private score 0.91158. I also tried stacking several of my models, which resulted in the public score 0.87242 and the private score 0.91263.

Final Thoughts

  1. During the contest, I didn't make good use the time. If I explored the time deeper, I might end up with being the top.

  2. Here are some offline solutions to the robot bidding problem.

    • We could record the robots' IP, bidder_id, payment account and address. If we encounter them in the future, we could ban them from bidding.
    • We should run the machine learning algorithm on regular bases to adapt to new robot behaviors.
  3. From the most important features of the model, we could derive some online solutions. Here are a few.

    • We can set a limit to the maximum number of bidding over an auction.
    • We can ban users logging in from multiple devices.
    • We can set a limit to the maximum number of bids a user can place a day.
    • Record the median time between a user's bid and that user's previous bid. If it is lower than a threshold, flag the user as robot.