Introduction Slide

I am Angel D'az. I am a data analyst who fell into data for mostly practical reasons. Much in the same way, while I was delivering pizzas in college, I started collecting data on those deliveries for practical reasons. I wanted to build my first Excel, then Python, and finally statistics skills to be able to sell to potential employers.

Describe the dataset

My then-partner, who delivered pizzas out of the same shop, obliged me and collected data with me, so that I could input it all manually in Excel. Leaving out personally identifiable information, I collected all sorts of fields; type of home, tip amount, order amount, cash/credit, neighborhood, order time, etc. At the end of a few months, I had 1301 rows of data in my hands

In the beginning... there was Excel

I created about 8 charts in Excel at first and a few small patterns came out; tips got slightly better over time in the day. About 5 cents every 15 minutes from 10:30 am to 12:00 am. It would have been nice to collect data on Blood alcohol content and seen if there was any correlation there with the better late night tips, but maybe it would have been unethical.

Weak Correlations Everywhere

When I moved on to statistics and Python, I found that results weren't as ground-breaking as I hoped. There was not much of a difference between my then-partner and my tips, and a statistical difference of means t-test showed that we failed to reject the null hypothesis that we had the same tips. To repeat, ``` Null and alternative hypotheses The null hypothesis (H0) often represents either a skeptical perspective or a claim to be tested.
Thea lternative hypothesis (HA) represents an alternative claim underconsideration and is often represented by a range of possible parameter values


H0 : mu_a = mu_s // lines 17 and 18

Ha: mu_a != mu_s 

### A recipe for difference of means t-test
I am pretty happy with how I used pandas functions to grab all the ingredients I needed for the t test

* All data  

```python
aDels = data.loc[data['PersonWhoDelivered']=='Angel'][['Tip']]
sDels = data.loc[data['PersonWhoDelivered']=='Sammie'][['Tip']]

2 averages

mu_a = aDels.mean() #Angel Delivery mean = 3.376236
mu_s = sDels.mean() #Sammie Delivery mean = 3.412869

2 standard deviations

sigma_a = aDels.std() #Angel stdev = 2.184814
sigma_s = sDels.std() #Sammie stdev = 2.040795

2 variances

variance_a = aDels.std() * aDels.std() #Angel variance = 4.773414
variance_s = sDels.std() * sDels.std() #Sammie variance = 4.164844

2 sample counts

n_a = aDels.count() #Angel sample size = 712
n_s = sDels.count() #Sammie sample size = 589

1 degrees of freedom

df = n_a + n_s - 2 -1 #n-k-1: degrees of freedom = 1298

Then we start mixing the ingredient:

Make the first denominator by dividing my variance of tips by sample count

den1 = variance_a/n_a #formula chunk: 0.006704

Make the second denominator by dividing Sam's variance of tips by her sample count

den2 = variance_s/n_s #formula chunk: 0.007071

Adding up both denominators

den3 = den1+den2      #formula chunk: 0.013775

Square rooting the denominator

den = math.sqrt(den3) #denominator value: 0.11736812016136079

Making the numerator, which is simply the means subtracted.

num = mu_a - mu_s     #numerator value: -0.036633

Finally (math-wise), our T-statistic and p-value

Here's something I'm realizing now. The den variable in line 47 is the Standard error of our two means. We divide the difference between the two means by the standard error to get t_stat value or t-score.

t_crit = 1.9673
t_stat = num/den
pval = 1-stats.t.sf(t_stat, df)

So someone as nice and attractive as my then-partner, got the same average tip as me? That tells me that people tip pretty much the same across the board, barring extraordinary circumstances.

Another weak correlation

Creating a simple linear regression model of tip amount in dollars on distance from the store gave me an R squared of 0.011 and the following model:

Tip Amount in dollars = 3.0295 + 0.1588(distance from store in miles)

When looking into the data, I find that out of the deliveries that were greater than 5 miles away; 33/38 of them were from the same, well-to-do, neighborhood in the foothills. Removing these from our dataset, because they were skewing our model, made the R squared 0.00.

Tip Amount in dollars = 3.3 + 0.0194(distance from store in miles)

Virtually everything was out of the driver's control. People tip what they're going to tip.

Or was it out of our control?

The first value of these two null results

I now knew I would get tipped pretty much the same as my then-partner. I had some emotional backing to the idea that when I had a bad night, it was not a reflection of me but rather just bad luck. When you're delivering pizzas late at night, have to get up early for school, and overwork yourself like I tend to do, emotional backing for bad nights is worth it's weight in gold.

The second value of these two null results

Despite getting an R squared of 0.00 in the second model, there was an implicit variable I hadn't discussed yet. That is; time. For statistically the same tip, drivers spent more time delivering pizzas when they were farther away. More time spent means less deliveries, which translates to statistically less tip total amount for the night. I learned to optimize for distance when I had wiggle room to decide which deliveries to take, that is, if I wanted to make more money rather than an easy ride with the windows down and sunglasses on.

Null results are harder to communicate

It takes a whole 30 minute talk to explain two valuable null results, whereas, "do this one thing and increase your tips!" is much more enticing to sell. Internship interviewers always seemed to disregard my null results, but in a way, getting excited about these findings were a signal to me that maybe a data nerd was sitting across from me.

Wednesday 2:05 PM–2:45 PM in Radio City (#6604)

The Value of Null Results

Angel D'az

Description

Abstract