Class 6
Go to cocalc.com
Sign up with your SVSU email address.
You will have 2 weeks to pay the $18.85 fee.
Environment for statistical calculation
Descendant of “S” system, created at the Bell Laboratories in 1975
R was created in 1993 at the University of Auckland, New Zealand
De facto standard in data science and statistical calculation
Used by researchers in academia as well as large companies in pharma industry (Pfiser, …), financial industry (Mastercard, many banks), insurance (Aetna, …), consulting (Deloitte, …), Google, Amazon, Meta, newspapers such as NYT, and many others.
Libraries, also known as packages are collections of R code created to add some functionality to the base R system.
More than 20,000 or them on CRAN (official R package repository), plus countless more at other places.
Tidyverse: a collection of packages designed to make data manipulation easier.
Mosaic: a library originally designed to help teach and learn statistics, but is now often used in “production”.
action(parameters): perform some action with some parameters. Examples:
glimpse(penguins): print some information about the penguins data set structure.
read_csv("frogs.csv"): read a data set from a comma separated file called frogs.csv.
action(formula with variables, data = dataset, additional parameters): do some sort of analysis described by the action on the variables mentioned in the formula, from the given dataset. Examples:
tally(~species + island, data = penguins, format = "percent"): creates a contingency table for the two variables in the data set.
gf_histogram(~Height, data = classdata, binwidth = 1): creates a histogram for the variable Height in the classdata data set.
dataset |> action1(parameters) |> action2(more parameters) |> action3(...): Perform a series of actions on the given data set. Example:
variable_name <- some calculation: Store the result of some calculation in the memory under the name variable_name. Examples:
m <- mean(~Height, data = classdata): calculate the mean of the variable Height in the classdata data set, and store it under the name m for future use.
classdata |> mutate(deviation = Height - m) -> classdata_modified: Add a deviation column to the classdata, and store the resulting data set as classdata_modified.
Are used in R to specify which variables in a given data set to analyze, and what kind of relationship between the variable are we looking for.
A formula will always contain the ~ character (tilde).
Single variable case: ~varname, for example ~Height or ~species or ~bill_length_mm.
Analyze one variable, but group the values by another variable first:
~var1 | var2, for example ~species | island.
Analyze how values of one variable depend on the values of another:
var1 ~ var2, for example bill_length_mm ~ species
Analyze some sort of combination of two variables:
~var1 + var2, for example ~species + island
and many more.