In today's increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they've become accustomed to using. For most business people, the go-to tool for doing anything analytical is Microsoft Excel.
If you're an Excel user and you're scared of diving into R, you're in luck. I'm here to slay those fears! With this post, I'll provide you with the resources and examples you need to get up to speed doing some of the basic things you're used to doing in Excel in R. I'm going to spare you the countless hours I spent researching how to do this stuff when I first started so that you feel comfortable enough to continue using R and learning about its more sophisticated capabilities.
Excited? Let's jump in!
Quick note before we do: There are usually multiple ways to do everything in R. I'm going to show you the way I do it, but if you know of a better/shorter/faster/easier way to do something, please leave it in the comments!
Let's start with the basics. You'll want to make sure you have downloaded and installed R. I'm also using RStudio as my IDE, so you should install that as well. You'll be glad you did; it's awesome.
You'll also want to install and load the
ggplot2 library, which not only contains the data set we want to use but will also come in handy when we get to creating charts and graphs later. We will also install and load the
dplyr library to help with manipulating the data.
We are going to use the diamonds data set that comes with
ggplot2. The data set contains prices and other attributes of over 50,000 diamonds.
OK, so let's take an initial look at the data. You can type
diamonds into the R console and it will print out the data set in the console screen, but I advise against doing this. If you're an Excel user, you're used to viewing data in a tabular format. You can do that in one line of code.
This should have created a data frame object in RStudio's upper right hand pane (the one called "Environment") and it should say "53940 obs. of 10 variables" right next to it. This means the table has 53,940 rows and 10 columns. Click on diamonds in that pane, and RStudio will show you the table.
It should look like this:
The first 7 columns are pretty well labeled, so we won't mess with those, but the last 3 aren't labeled very well. So let's rename columns 8, 9, and 10. We'll call them length, width, and depth respectively.
You'll notice that now we have two columns named depth. Let's rename the first one (column 5) to "depthperc."
Now the data frame should look like this:
One of the most common things people do in Excel is perform calculations. For example, if we wanted to multiply length, width, and depth together; we would type
=H2*I2*J2 in cell K2 in Excel and then copy that formula all the way down to the last row of the data set. Then you would probably name the column "cubic" or something like that.
In R, you can perform all 3 of these actions with a single function.
diamonds <- mutate(diamonds, cubic=length*width*depth)
Here we are using the
mutate function on the diamonds data set to multiply length, width, and depth. It's assigning the outcome of that to a new column called cubic.
Another way of doing this, which may be more approachable to someone coming from Excel, is as follows.
diamonds$cubic <- diamonds$length * diamonds$width * diamonds$depth
The data frame should now look like this:
Challenge: Try creating another column called "total" where you add the columns together instead of multiplying them.
The next most common task that Excel is used for is summarizing data. These tasks range from simply calculating column totals to the more intermediate pivot tables. I'll show you how to do both in R.
First, let's say that we want to summarize our data set and calculate the overall averages for all the numeric fields (carat, depthperc, table, price, length, width, depth, and cubic). This would be the equivalent of going to the bottom of a column in Excel, typing in
=AVERAGE(A2:A53940), and then copying that formula over to the bottom of all the other columns you wanted to average.
In R, you would use the
Here, we are telling R to calculate the column-wise means for column 1 and then columns 5 through 11. This will print the averages for those column numbers in the R console.
carat depthperc table price length width
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
Let's say you wanted to add carat to the non-numeric fields and then calculate the averages for each combination of the new group of non-numeric fields. This would take a bit of work in Excel (maybe even some pivot-tabling), but is pretty easy in R.
First, let's round the carat values to the nearest 0.25 carat so that our numbers are not all over the place.
diamonds$carat2 <- round(diamonds$carat/.25)*.25
Now, let's create our summary.
Summary <- aggregate(cbind(depthperc, table, price, length, width, depth, cubic)~cut+color+clarity+carat2, data=diamonds, mean)
You're basically telling R to aggregate the diamonds data frame, take the mean of all the numeric fields, and group by the non-numeric fields. The result is a Summary data frame that looks like this:
Next, we'll replicate Excel's useful pivot table functionality. Now that we have the averages for each of our numerical fields, let's choose one and see how it tends to change based on some of the non-numerical attributes. Say we wanted to analyze the difference in average prices of diamonds of different color & clarity combinations. In Excel, you might create a pivot table with color as a Row Label, clarity as a Column Label, and average price in the Values section.
In R, you can use the
reshape2 package to do the exact same thing. First, install and load the package.
Then, we'll use the
dcast function to get our data into the same pivot table format.
pivot_table <- dcast(diamonds[,c('color','clarity','price')], color~clarity, mean)
Here, we're taking the color, clarity, and price columns from the diamonds data frame, casting (pivoting) them out by color (rows) and clarity (columns), and calculating the average price for each combination.
Challenge: Try casting/pivot-tabling using a different combination of non-numerical fields and the averages of one of the other numerical fields.
Another very common thing people do in Excel are vlookups. The scenario arises where you have two related data sets and you want to pull some values from data set B over to their appropriate place in data set A. So you type something like
=VLOOKUP(A2,K2:K50,2,0) and Excel looks up the value in A2 in column K and returns the value in the column next to the matching value.
In R, we can do this using the
merge function. So let's say we wanted to calculate how far above or below a diamond's price was compared to the average for their cut,color, clarity, and carat. In this case, our data set A will be the diamonds data frame and data set B will be the Summary data frame.
First, let's change the name of the price column in the Summary data frame to avgprice. This way, we won't have two price fields when we bring it over.
Next, let's merge the data sets and bring over the average price.
diamonds <- merge(diamonds, Summary[,c(1:4,7)], by.x=c("cut","color","clarity","carat2"), by.y=c("cut","color","clarity","carat2"))
We merged the diamonds data frame with just the columns that we needed from the Summary data frame and the result was that it added the avgprice field to our diamonds data frame.
Challenge: Merge the averages of some of the other numerical fields in the Summary table over to the diamonds data frame.
Bonus Challenge: Calculate what percentage over/under priced each diamond is compared to the average. Hint: Remember how to do column calculations?
Excel users also periodically use conditional (IF) statements for filling in values according to whether certain conditions are met. R is also very good for doing this.
Let's say we wanted to categorize diamonds into size categories such as small, medium, and large based on their carat weight.
diamonds$size[diamonds$carat < 0.5] <- "Small"
diamonds$size[diamonds$carat >=0.5 & diamonds$carat < 1] <- "Medium"
diamonds$size[diamonds$carat >= 1] <- "Large"
Here we've set anything less than 0.5 carat to small, anything between 0.5 and 1 carat to medium, and anything 1 carat and above to large.
Charts and Graphs
The last group of Excel common tasks we'll cover here are the creation of charts and graphs. Excel has a very "drag-and-drop" method of graph creation, whereas R has a very "type out what you want" method. This may be a little daunting at first, but once you get the hang of it, you'll start to find it easier to customize charts and graphs in R than having to fish around in Excel for the right menu option to make the change you want.
Based on my experience in a business environment, I'm going to cover the 3 most common graphs I've seen people create in Excel - the bar/column chart, the line chart, and the scatterplot.
Taking a look at our diamonds data set, let's say we want to create a chart that shows how many diamonds of each size (small/medium/large) are in our data. Here's how you would do that in R.
barplot(table(diamonds$size), main="Diamond Size Distribution", xlab="Size Category", ylab="Number of Diamonds", col="blue")
This produces a bar chart that looks like this:
The second type of chart we're going to create is a line chart. These are usually used when you have data that changes over some period of time and you want to see the magnitude and velocity of those changes. Since our diamonds data set doesn't have any time series data in it, we'll do something a little different. We will create a line for each color and see how the number of diamonds of that color change across clarity categories.
Here's how to do that with the
ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = color, colour = color)) +
labs(x="Clarity", y="Number of Diamonds", title="Clarity by Color")
And here's what it looks like.
It looks like most diamonds fall into the middle clarity categories. Also, pretty interesting that there are more G color diamonds in the higher clarity categories than any other color.
Now let's do a fairly simple scatter plot so you can get a sense of how to do one in R. For this, we are going to use the
ggplot command again.
ggplot(diamonds, aes(carat, price, color=clarity)) + geom_point() +
labs(x="Carat Weight", y="Price", title="Price by Carat Weight")
Alternatively, you can produce the same thing with the
qplot function as well.
qplot(carat, price, data=diamonds, color=clarity, xlab="Carat Weight", ylab="Price", main="Price by Carat Weight")
The resulting plot shows the relationship between the carat weight and the price of the diamonds in our data set, and we've also set the points to be different colors according to the clarity of the diamond. The graph below shows us that the larger the diamond and the better the clarity, the more expensive it tends to be.
To create other types of charts and graphs, the ggplot2 index site is a wonderful resource that has code and visuals for different types of graphs.
Well, there you have it - a guide to get almost any Excel user started in R. Two things I want to mention before I leave you to explore some more on your own.
First, I've found that if you want to get good at R (or anything really), the trick is to find a reason to use it every day. It doesn't matter if it's something small, just open up R Studio and go through some of the exercises you've learned or try to stretch your knowledge and do one new thing in R each day. You'll find that after a month or so you'll be much more comfortable with it and hopefully you'll keep getting better at it and never look back!
Second, the Internet is your friend. Part of R's appeal is the community of people that use it, write about it, and ask/answer questions about it online. There are a vast number of websites just a click away that can guide you in the right direction when you're stuck on something (this blog included). All it takes is a little searching.
Finally, if you found this post useful, go to the blog's home page and click the Subscribe button. We'll be cranking out a lot more great, educational data science content in the near future and I'd hate for you to miss any of it.
May the learning be with you!