Data science is becoming more important for socially relevant studies, but the social aspect makes the pitfalls that develop along the way that much more important to catch early on. Using The Cost of Public School project done by Microsoft's data science school as a case study, we can examine the processes involved in a good study, and the traps we may land in.
In a social science focused research project, identifying the problem we want to solve is the first and sometimes the hardest step. Before looking at problems, we tried to nail down exactly what we wanted to see, our ideal experiment.
High quality data is limited, computational time is limited, and frankly, we as human beings are as well. To define the realistic question for The Cost of Public School, we had to cut down our ideal to consider what data we could get within our time limit, while still remaining true to the core social issue; the belief that housing costs in a particular zone skew the reality of NYC's supposedly free and open public school system.
Important topics of discussion included: what data do we need on housing, to make sure that results are normalized over various types? What types of housing (apartments, rentals, sales) can we acquire, and how will the data we can't acquire affect the impact of the experiment? What factors other than housing could affect the cost of housing and how can we grab accurate data for them and quantify them?
At this point, you've made your decisions on the form of the data you will need to do your experiment. In a social science-focused setting, however, good quality data can be much harder to come by.
NYC OpenData can be a godsend for social science projects focused here, but as we discovered on our journey, a lot of it can be hard to decipher, poorly organized, be incomplete, or simply lack the data necessary to answer the question. In this case, third party sites, their APIs and web crawling become very important and necessary.
As important as third party data can be, due diligence is required to clean it up. Preliminary data exploration is important to make sure what you have makes sense. "Sense" is a very expansive term here; in this case, it involved its own research. This will come up numerous times over the course of a project as well.
You have as much data as you can get, and your preliminary digging and excluding has given you what all data scientists love: a clean data set.
Unfortunately, most of the time, this means we've dwarfed our original data down to a shadow of what we pulled from various sites. In many cases, this means a model is necessary to project the data you would have liked to be able to collect.
The analysis portion can be one of the most intense parts of a social science project. It's more than just getting averages and crunching numbers; not only do you have to know what the numbers mean, but what they are defining SOCIALLY. This is where a diverse team comes in handy. This is where personal experience may be an indication of where to go next and what you've missed.
But don't fret. This is just the beginning. Social science projects can take that much longer because of this aspect, and in our case, because socially driven research is at the mercy of the population, for whom there will simply never be enough data on for our tastes. Still, by touching upon just the edge of a social issue, we open up the door for much more research to be done.