R (programming language)

BASIC INFORMATION

  • R (open source language) is successor of S (proprietary language) 
  • Like Python use # for comments
  • Use Ctrl + Enter to run the current line
    • OR select multiple lines and press Ctrl + Enter
  • Ctrl + L to clear console in R studio

  • How to print variable?
    • use print(variable)   → CAN only print 1 object at a time
    • OR select variable and hit Ctrl + Enter
    • Put things in () to print while doing assigning
      • (x <5)   → makes x=5 and prints x 
        x <- 5    → only makes x=5
    • Use View() for better visualisaion (for matrix or data frame)

  • Variable name can have {letters, number, . , _ }
    • But can start only with letter or .
  • If printing multiple things on same line
    • USE ; (semicolon)
    • Eg.- a <- 10; b <- 20

  • Write (TRUE & FALSE) or (T & F) → but True or true are both wrong
  • Can store string in " " or ' '
  • single-quoted strings can’t contain single quotes
    Similarly, double-quoted string can’t contain double quotes
    • 'I am 'T' the don' → WRONG
      "I am 'T' the don" → Correct
    • But we can write single quotes in double-quoted string and vice-versa

  • In-built functions ==> can be applied on both vector and single number
    • ceiling, abs, sqrt, floor, sin, log, log2, log10, exp, round, sum, prod, max

  • s <- "HE28llo"     => assign variable 's' with string
    substr(s, 2, 5) → E28l  [NOTE: - indexing start from 1]
    nchar(s) → 7 which is count of number of characters in string

  • class(....) function strength ==> list > character > numeric > logical
  • Integer Division (%/%) → c(2,3,5,7) %/% 2   ==> 1 1 2 3
    • Modulo Division (%%) → c(2,3,5,7) %% c(2,3)  ==> 0 0 1 1
      • equivalent to (2%2) (3%3) (5%2) (7%3)
      • Also, length of (2,3,5,7) = 4 must be divisible by length of (2,3) = 2

  • Logical Operators -
    • xor(a, b) ✅            a ^ b ❌
    • &, |  → element-wise comparisons (if vector) → returns vector of results
      &&, || → only evaluates single condition and not vector like & or | 
      • Let x = 1:6
        (x > 2) & (x < 5) →  F  F  T  T  F  F
        x[(x > 2) & (x < 5)]  → [3, 4] → logical indexing => get values that are T

      • (x > 2) && (x < 5) ❌   bcz  &&  cannot parse vectors
        (x[1] > 2) && (x[1] < 5)  ✅ 

    • Vectorized if statement
      Let x = 1:6
      ifelse( x < 3, x², x +1)   → if x < 3 then x = x² else x = x+1

  • Functions →    name <- function(x, y) { x+y }
                               name(3, 4)
    for ( i in 1:5 ) { print( i+1) }
    repeat { .... } is same as while(TRUE) { ... } 

  • Factors in R - helps to categorize data and store it as levels 
    • factor(V) → created factor where V is vector + levels are unique values from V
    • Labels - Human readable names associated with each level
    • Ordering - Factors can be ordered (ordinal) or unordered (nominal)

  • How to import dataset
    • read.csv('Full path of dataset')
    • read.csv(file.choose()) → pop-up appears to select file
Data Structures
  • Vector → 1-D array that can hold elements of the same data type
    • created using functions like c() or seq()  [c means combine]
      • c(1,2,5.3,6,-2,4.78) 
        0:7 ===> [0,1,2,3,4,5,6,7]
        7:0 ===> [7,6,5,4,3,2,1,0]
        seq(1, 10, length = 5)   --
        > 1 to 10 with equal spacing of total 5 length
        seq(from=0, to=6, by = 0.2)  → 0
         to 6 with spacing of 0.2
        rep(1:5, times=2) → 
        1 2 3 4 5 1 2 3 4 5
        rep(1:5, each=2) → 1 1 2 2 3 3 4 4 5 5
      • scan()- after running go to console and enter numbers
        => press 'enter' twice to stop taking input


    • To access
      • a<- c("ram" = 12, "by" = -2)
        a["ram"] => Correct {bcz key & value pair => can access using key}
        a[12] => NA

      • cnt <- c("one","two","three","four")
        cnt[3] = three
        cnt[9] = NA
        cnt[-2] = "one" "three" "four"  → all except index 2
        cnt[2:4] = "two" "three" "four"  → all from 2 till 4

        cnt[2, 3, 2] => Wrong
        cnt[c(2, 3, 2)] => "two" "three" "two" [Correct]


    • Built-in Functions -> length(a), sort(a), sort(a, decreasing = T)
    • To delete -> assign NULL to the vector
    • If 2 vectors are of same size => can do (a+b) or (a/b) or other operations
  • List  2-D array that can hold elements of different data type
    • a<- list(1,5.3,-2,c("one","two","three")) 
    • To convert list to vector
      a<- list(1,5.3,-2)
      v<- unlist(a)

  •  Matrix → 2-D array with all same data type
    • matrix(nrow=3, ncol=2, data=c(1,2,3,4,5,6))
           [,1]  [,2]          → to access => x[3,2] = 6
      
      [1,]    1     4          → to enter data row-wise => make byrow = T
      
      [2,]    2     5          → if data = 1:2 => same as {1, 2, 1, 2, 1, 2}
      [3,]    3     6               NOTE - data must be multiple of (r*c)
      
      
    • x[3,] = {3 6}  → row
      x[,2] = {4 5 6}  → column
      x[2:3, 2] → for submatrix
      
      
    • cbind - combines vector, matrix or data-frame by column
      
      t(matrix) → transpose of matrix
      dim(matrix)  → 3 2  {rows, columns of matrix => dimension}
      nrow(
      matrix) or ncol(matrix) → tells number of rows or columns of matrix

    • diag(1:10, nrow=5, ncol=7) → creates diagonal matrix with diagonal elements
      from 1 to 10 {NOTE - diagonal will be till 5 only bcz no more row available)

    • matrix * 5 or any arithmetic operation with constant or with another matrix of
      same dimension → that arithmetic operation is done with every element
      For matrix multiplication → use % * %
      crossprod(matrix)   ≈   t(x) % * %x

    • solve(matrix)  → for inverse of matrix
      eigen(
      matrix)   find eigen values and eigen vectors of a matrix

  • Dataframe → 2-D array like structure whose each column can have different data types but in same column -> same data type
    • emp<-data.frame( 
      id = c(1:3), name = c("A","B","C"),sal = c(523.4,98,452.89)
      stringsAsFactors = FALSE)
      • stringsAsFactors = FALSE  → because R has tendency to convert character vectors into factors => telling R to keep columns as characters and not convert them to factors automatically
        This gives us control over data, as we can convert specific columns to factors when needed.

    • str(emp) => gives structure of whole data frame
    • To access 
      • column →   f1<- data.frame(emp["name"]) 
      • row → 
        f1<- emp[2,] => for only 2nd row
        f1<- emp[2:5,] => for 2nd, 3rd, 4th, 5th row
      • emp[2,3] => for element at (2, 3) position

    • To Add 
      • list(4, "D", 61) to emp as new row
        • x<-list(4, "D", 61) 
        • rbind(emp, x)
      • vector(12, 4, 5, 9) to emp as column
        • y<-vector(12, 4, 5, 9)
        • cbind(emp, Age = y) → NAME of column is Age with values of y
    • To Remove
      • Row => emp<-emp[-2,]
      • Column => emp$id<- NULL
Detailed Information
  • paste - Concatenate vector after converting them all to character and returns character vector => can be assigned to variable
    • paste(10, 20) → "10 20" (use space as the default separator)
      paste(10, 20, sep="-")    → "10-20"
    • x <- c("a", "b", "c")
      paste(x, 1:5) → "a 1" "b 2" "c 3" "a 4" "b 5"

    • collapse - all elements of the resulting vector are combined into a single string with specified separator between them specified by collapse
      • x <- c("a", "b", "c")
        paste(x, 1:5, collapse = '-') → "a 1-b 2-c 3-a 4-b 5"
      • 'sep' separates elements within each position of the concatenation
        'collapse' joins the final results into a single string


    • paste0 - Concatenates strings without any separator (baaki properties same hai)
      • slightly more efficient than paste

  • cat - Concatenate and prints directly to console => used for debugging 
    • cannot store in variable
    • use  '\n'  or  '\t' as newline character or tab with cat only → with paste or print, it gets treated as normal string (NOTE - can use '\n' or '\t' as sep with paste)

  • letters → built-in function that contains 26 lowercase English alphabets
    LETTERS → contains 26 uppercase English alphabets
    letters[ c(2,4,6) ]  → "b" "d" "f" and with LETTERS[ c(2,4,6) ] → "B" "D" "F"

  • Use readline for user input (like cin>>x)
    • Use prompt for better user experience
      • Eg.=> a<- readline(prompt = "Enter your name: ")

  • Naming of Vector or List
    • cnt <- c("one","two","three","four")
      names(cnt)= c("a","b","c","d")
    • Now, b represents two
      cnt["b"] or cnt$b → gives "two"

  • ls()- stands for "list objects" & list names of all objects
    rm()- stands for "remove" & remove objects from current workspace
    summary()- all needed data like mean, median, quartile range, ...

  • = VS <- → use <as it will always be correct but = can fail in 1% cases
    • x <- y <- 5 is correct and x=y=5
      x <- y = 5 is wrong bcz
       = has lower precedence → hence first x=y then y=5

    • Difference in scope when use them to set argument value in a function call
      • '='  → scope of function => doesn't exist in user workspace
        '<-'  → scope of user workspace => also valid after function call completes

    • Press Alt and '-' together for '<-'
Graphics
  • plot(a, b, main="TASK-7", xlab="Salary", ylab="Age", pch=19, col="red", cex=1.5)
    • main creates heading => TASK-7
      x-axis label => Salary
      y-axis label => Age
      col is for colours
      pch = 19 for solid circles
      cex = 1.5 → make everything 150% in size

    • plot function automatically finds the best possible distribution for given data considering the number of parameters and number of variables

  • Categorical variables - use bar chart (use barplot)
    • Don't use barplot(..) directly → creates bar graph for all raw data → Messy
      First make table(..) then use barplot → table summarizes data => better visualize
      • table - used for creating frequency tables and cross-tabulations
        => summarizing categorical data

  • Quantitative variables - use histogram (use hist)
    • Observe these aspects 
      • Shape - whether skewed, symmetrical
      • Gaps in histogram and Outliers

    • Histogram in groups -
      par(mfrow = c(3,1)) → Puts graph in 3 row and 1 column
      make histogram for each of the 3 variety
      par(mfrow = c(1,1)) → Restore the graphic parameter

    • breaks parameter in hist - determines the number and width of bins(intervals)
      • Less bins - oversimplify data and hide important details
        More bins - overcomplicate data and show noise instead of trends
      • breaks = 5 → Divide data into 5 bins of nearly equal size
        breaks = c(0, 10, 20, 30) → Specifies where each bin starts and ends

    • xlim parameter in hist - limits the range of x-axis
      • Data outside range will be included in calculations but not shown on plot
    • freq parameter in hist - if set to False => considers density (not frequency)

  • For two quantitative variables - use scatter plot (use plot)
    • Observe these aspects 
      • Linear - whether they follow linearity or not
      • Consistent spread across the plane
      • Outliers and correlation

Comments

Popular posts from this blog

Corporate Tips

Latex + Matlab + Jupyter Notebook + Google Colab + Markdown