View on GitHub

Enronfraud

Python, Machine Learning, Scikit-Learn,

Download this project as a .zip file Download this project as a tar.gz file

Intro(From Wikipedia)

The Enron scandal, revealed in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.[1]

Enron was formed in 1985 by Kenneth Lay after merging Houston Natural Gas and InterNorth. Several years later, when Jeffrey Skilling was hired, he developed a staff of executives that, by the use of accounting loopholes, special purpose entities, and poor financial reporting, were able to hide billions of dollars in debt from failed deals and projects. Chief Financial Officer Andrew Fastow and other executives not only misled Enron's board of directors and audit committee on high-risk accounting practices, but also pressured Andersen to ignore the issues.

Project Goal

The goal of this project is to identify the employees from the Enron incorporation who committed fraud. I label this kind of employees as POIs. The dataset contains information about the employees. There are two kinds of information available for analysis. The first one is the financial information, including the salary information, bonus information, stock information, etc. The second one is the email information, including all the email texts, the number of total sending and receiving email, the number of email sent to the POIs, the number of email received from the POIs.

Machine learning algorithms leverage these information to identify possible POIs. There were two outliers in the dataset. I identify them through inspecting the length of the name of the employees. Since one outlier is the summary information of all employees and the other is difficult to understand, I removed these two outliers.

Feature Engineering

I created 4 new features: ‘hasEmail’, ‘fromPoiRatio’, ‘toPoiRatio’, and ‘total_money’. ‘hasEmail’ indicates if an employee’s email address is available or not. Since I suspect that POI should be higher positions in the company, an email for them is necessary to do business, I add this feature. ‘fromPoiRatio’ and ‘toPoiRatio’ show the percentage of the email an employee receives from or sends to the POI out of all the emails he receives or sends, respectively. I add these two features since I think if an employee has high proportion of emails from or to POI, he is probably a POI. ‘total_money’ is the total money an employee gets from salary, bonus, all kinds of stocks, etc. I think this features could be useful since the only reason POIs commit fraud is money.

Performance Evaluation

Two evaluation metrics I used are Precision and Recall. Through cross-validation, the average performance is Precision: 0.425, Recall: 0.447.