Skip to content

Lesson 13: RStudio Basics

Lesson 13: RStudio Basics

Objective

You will learn the RStudio interface, as well as a few basic commands to discover the structure behind a data set.

Vocabulary

pane, preview, console, plot, environment

RStudio Commands

data( ), View( ), names( ), help( ), dim( ), tally( ), load_labs( )

Essential Concepts

Lesson 13 Essential Concepts

The computer has a syntax, and it can only understand if you speak its language.

Lesson

  1. The Dashboard and PlotApp are data visualization tools that are coded in R, the statistical programming software that academics and professional statisticians use. The Introduction to Data Science course utilizes RStudio, which also runs on R. You will learn the programming language of RStudio for data analysis. Watch the following video to learn RStudio Basics before moving on.

  2. You can access RStudio by going to your server:
    https://idsucla.org/ids-servers, then click on the RStudio icon on the page.

  3. Your RStudio login is the same as your IDS App and IDS Homepage login.

  4. Once logged in, notice each of the following panes, or rectangular areas, of the RStudio interface:

    1. preview (spreadsheet) - where you will be able to see the variables and observations (index); rows and columns of data

    2. console - where you will be entering your code

    3. plot - where your plots/graphs/visualizations will be generated

    4. environment - where you will see values and objects

  5. You will be looking at a data set from The Centers for Disease Control and Prevention (CDC), a government agency that collects data on a broad range of topics, including issues concerning teenagers.

  6. You can load and view the CDC data file to the workspace by typing the following commands into the console, then pressing the Enter or Return key on your keyboard:

    >data(cdc)

    >View(cdc)

    Note: If you wish, you can take notes in an RScript. As a reminder, to create an R script, find the Menu tab, go to File and click on New File, then click on RScript. Type the commands below in your RScript and Run your commands. Refer back to the video to learn how to use an RScript.

  7. Examine the preview pane. How are the data displayed?

  8. Where on the spreadsheet can you find the variables?

    1. Type the following command in the console: >names(cdc)

    2. What do you notice? What is one variable of this data set? How many variables are there? How does this output compare to the information in the preview pane?

  9. The previous command lists the names of each variable in this data set, but there is a command that gives you more detailed information about the data set.

    1. Type the following command in the console: >help(cdc)

      A document should appear on the bottom right-hand pane under the Help tab with more details about the CDC dataset.

    2. What unit of measurement is height reported in?

  10. You can also find the number of rows and columns in the data set.

    1. Type the following command in the console: >dim(cdc)

    2. Which number do you think represents the rows? Which one represents the columns? How does this output compare to the information in the preview and environment panes? How many observations are there in the data set? How many variables does this data set contain?

      There are 15,624 rows, or 15,624 observations;and there are 33 columns, or 33 variables. This information is also visible in the environment pane.

  11. You can also obtain the number of observations of a specific variable.

    1. Type the following command to get the number of observations for seat belt wearers: >tally(~seat_belt, data = cdc)

      Notice that six categories are displayed. Each category shows the number of observations contained in it. For example, “Never” has 326 observations, meaning 326 teens reported never wearing their seat belt as a passenger in a motor vehicle. <NA> = Not Available, represents teens that did not provide information about their seat belt habits.

    2. What else do you notice?

  12. Now change the variable to height:

    1. Type the following command: >tally(~height, data = cdc)

      Notice that categories are missing. This happened because the variable height contains numbers, not categories.

    2. What else do you notice?

  13. Let’s take a closer look at the variables seat belt and height. Brainstorm the following question:

    What is the difference between the data from the variables seat belt and height?

  14. To summarize: In data science, the variable seat belt is what we call a categorical variable, and the variable height is what we call a numerical variable.

  15. Let’s look at the other variables in this dataset.

    Categorize each variable as categorical or numerical:

    1. eat_fruit

    2. weight

    3. grade

    4. gender

  16. Throughout the IDS course you will be completing RStudio labs and learning RStudio code to work with data.

  17. You can load the menu of labs by typing the following code: >load_labs( )

  18. The load labs command displays a list of available labs and a selection prompt. To select Lab 1A, type the number "1" after the selection prompt.

  19. Next direct your attention to the plot pane, and notice the location of Lab 1A’s presentation. If you do not see it, click on Viewer or refresh the page.

  20. Click on the arrows at the bottom right-hand side of the presentation to view each slide. Pause on the slide titled “Syntax in action”. You should see 3 boxes, each containing a line of code.

  21. Every time you see a grey box with a line of code, you must type the code into the console. The output will appear either on the console itself or on the plot pane.

  22. Type in one of the lines of code. In this particular case, the output will be a plot. Notice that the location of the plot is in the same area of the slides, but under the Plots tab. You can toggle between the Plots and Viewer tabs by clicking each tab.

  23. You are now ready to complete Lab 1A.

Reflection

What are the essential learnings you are taking away from this lesson?

Homework

Continue to collect nutritional facts data using the Food Habits Participatory Sensing campaign on your smart device or via web browser.

Lab Time

It's time to begin learning how to do data analysis in RStudio! Before going on to the next lesson, you must complete Lab 1A, Lab 1B, and Lab 1C using RStudio. The following video will show you how to log in to your RStudio account and complete Lab 1A.

Lab 1A: Data, Code & RStudio

Lab 1B: Get the Picture?

Lab 1C: Export, Upload, Import

Complete Labs 1A, 1B and 1C prior to Lesson 14.