Updated: Jan 13
Every once in a while I like to review some of my core principals and one that has helped me tremendously is the concept of celebrating failures. I've made a habit of revisiting this TED Talk by Astro Teller on a regular basis and thinking of the projects I've worked on that had to be shelved for one reason or another.
The Data Science Intensive course I'm in the middle of has several projects that are very wide open and I want to spend a little time examining the project directions that I've REJECTED, and why. This post is the first in an ongoing segment where I look at what projects I've dug into, found fascinating, and rejected or shelved anyway. Today's examination? A Nonprofit Ethics Rating system.
The IRS maintains a public library of 990 forms turned in by a wide variety of nonprofit corporations, SOI Tax Stats. As a former Business Tax Auditor, I found this fascinating. 300 to 600 thousand plus tax filing forms PER YEAR in free to download .csv file format? Bring it on! You're in my world, now!
I got some ideas from Propublica's Article on Non- Profit giving. I've seen first hand how nonprofit entities simply spend more on their donors and employees than the people they claim to help. This got me thinking, what if I can build an ethics rater? I was off. Principal component analysis showed me that 95% of the variation in this gigantic dataset was controlled by a mere 50 out of more than 300 variables. I found a fascinating divide in the scale of average officer compensation based on whether or not the nonprofit was identified as 'politically active.' I fired up Benford's Law to look for anomalous numbers.
And I started realizing that I would not be able to create an effective ethics rating system using machine learning, because I would need to create a rating system myself. I could build a rating system that used the ratio of officer compensation to money spent on programs, the ratio of fundraising expense to fundraising income (occasionally you'll see as high as 95% of the money taken in fundraising goes directly into the cost of fundraising). The issue is that when you're building a model, you're typically testing for something, getting the machine to predict trends that you are not aware of yourself. If I engineered the target ethics rating, 1-10 stars, for example, then I would already know exactly which features fed into the rating. What would be the point of the modeling, then?
I'm glad I spent the time on this, I'm glad I found out about this large dataset, and I may return to it at a later date with a different question. For now though, I'll mark it on my celebrated failures board and move on.