R Programming

Tidyverse

April 27, 2022

O mundo Tidyverse Dentro do processo diário de análise de dados, você provalmente usará um dos pacotes abaixo (ou todos eles!): ggplot2 dplyr tidyr readr purrr tibble stringr forcats Mas saiba que esses não são os únicos packages dentro do mundo Tidyverse. Existem muitos outros pacotes que são instalados, principalmente aqueles relacionados a leitura de dados, datas e muitos mais (ex.: DBI, httr, googledrive, lubridate e etc). Você certamente encontrará muitos packages para toda a jornada de dados listada abaixo:

Supervised Learning in R: Classification

By Salerno in R Programming

August 19, 2020

Chapter 1 - k-Nearest Neighbors (kNN) 1.1 - Recognizing a road sign with kNN After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone. As it begins to drive away, its camera captures the following image: Figure 1: A caption Can you apply a kNN classifier to help the car recognize this sign? The dataset signs must be loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.

Credit Card Fraud Detection

By Bruno Ferrari in Classification

March 26, 2020

Objective Our goal is to train a Neural Network to detect fraudulent credit card transactions in a dataset referring to two days transactions by european cardholders. Source: https://www.kaggle.com/mlg-ulb/creditcardfraud/data Data credit = read.csv(path) The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days. As we can see, this dataset consists of thirty explanatory variables, and a response variable which represents whether a transation was a fraud or not.

German Credit and Regression Tree

By Bruno Ferrari in R Programming

February 7, 2020

Objetive Train a model and use to make predictions for German Credit dataset Data german = read.csv(path) str(german) ## 'data.frame': 1000 obs. of 21 variables: ## $ default : int 0 1 0 0 1 0 0 0 0 1 ... ## $ account_check_status : Factor w/ 4 levels "< 0 DM",">= 200 DM / salary assignments for at least 1 year",..: 1 3 4 1 1 4 4 3 4 3 .

Correlation and Regression

By Salerno in 82

January 19, 2020

path <- "C:/Users/andre/OneDrive/Área de Trabalho/salerno/blogdown/datasets/ncbirths" path <- paste0(path, "/ncbirths.csv") data <- read.csv(path, stringsAsFactors = FALSE) dim(data) ## [1] 1450 15 names(data) ## [1] "ID" "Plural" "Sex" "MomAge" ## [5] "Weeks" "Marital" "RaceMom" "HispMom" ## [9] "Gained" "Smoke" "BirthWeightOz" "BirthWeightGm" ## [13] "Low" "Premie" "MomRace" library(ggplot2) ggplot(data = data, aes(y = BirthWeightOz, x = Weeks)) + geom_point() ## Warning: Removed 1 rows containing missing values (geom_point). # Boxplot of weight vs.

Classifying using Logistic Regression

By Salerno in R Programming

January 13, 2020

1 - Objective The objective of this example is to identify each of a number of benign or malignant classes. 2 - Data Let’s getting the data. BCData <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"), sep = ",") # setting column names names(BCData)<- c('Id', 'ClumpThickness', 'CellSize','CellShape', 'MarginalAdhesion','SECellSize', 'BareNuclei', 'BlandChromatin','NormalNucleoli', 'Mitoses','Class') 3 - EDA - Exploratory Data Analysis It’s important to extract prelimionary knowledge from the dataset. dim(BCData) ## [1] 699 11 str(BCData) ## 'data.frame': 699 obs.

Diagnosing breast cancer with the kNN algorithm

By Salerno in 82

January 5, 2020

1 - Introduction Could the Machine Learning Algorithms detect beforehand any abnormal cell process? We know that this clinical battle is not so easy and there are a lot of people envolved in this process trying to identify a clear path to the cure. In complement to the decision human process, coult the technology decrease the subjective bias inherently in the process and improve our decisions? We absolutely know that the human being process is limited when compared to high capacity of the computers.