Title: | Data related to the book "R Statistical Application Development by Example" |
---|---|
Description: | The package contains all the data sets related to the book written by the maintainer of the package. |
Authors: | Prabhanjan Tattar |
Maintainer: | Prabhanjan Tattar <[email protected]> |
License: | GPL-2 |
Version: | 1.0 |
Built: | 2025-02-26 04:29:51 UTC |
Source: | https://github.com/cran/RSADBE |
The RSADBE package contains all the data sets used in the book "R Statistical Application Development by Example". Data sets have been collected from various sources and an attempt has been made to ensure that all the right credits are given. If some omissions are there, kindly accept the current work as a compliment for your work.
Package: | RSADBE |
Type: | Package |
Version: | 1.0 |
Date: | 2013-05-13 |
License: | GPL-2 |
This package is aimed to complement the book. Any data set required in the book may simply loaded using data(GC) as an example.
Prabhanjan
Maintainer: Prabhanjan Tattar <[email protected]>
Tattar, P.N. (2013). R Statistical Application Development by Example. Packt Publication.
data(GC)
data(GC)
A data set which reports the 5 different type of bugs for 5 software. The count frequencies are available for pre- and post- release of the data.
data(Bug_Metrics_Software)
data(Bug_Metrics_Software)
A three dimensional array on the bug counts of 5 software at 5 severity levels.
http://www.eclipse.org/jdt/core/index.php
data(Bug_Metrics_Software)
data(Bug_Metrics_Software)
Partitions play a very important aspect of CART methodology. This data set has been cooked to translate the intuitions into partitions!
data(CART_Dummy)
data(CART_Dummy)
A data frame with 54 observations on the following 3 variables.
X1
Input variable 1
X2
Input variable 2
Y
category of the output
Berk, R. A. (2008). Statistical Learning from a Regression Perspective. Springer.
data(CART_Dummy) CART_Dummy$Y = as.factor(CART_Dummy$Y) par(mfrow=c(1,2)) plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2") points(CART_Dummy$X1[CART_Dummy$Y==0],CART_Dummy$X2[CART_Dummy$Y==0],pch=15,col="red") points(CART_Dummy$X1[CART_Dummy$Y==1],CART_Dummy$X2[CART_Dummy$Y==1],pch=19,col="green") title(main="A Difficult Classification Problem") plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2") points(CART_Dummy$X1[CART_Dummy$Y==0],CART_Dummy$X2[CART_Dummy$Y==0],pch=15,col="red") points(CART_Dummy$X1[CART_Dummy$Y==1],CART_Dummy$X2[CART_Dummy$Y==1],pch=19,col="green") segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c(3.75,6.25,2.25,5),lwd=2) abline(v=6,lwd=2) title(main="Looks a Solvable Problem Under Partitions")
data(CART_Dummy) CART_Dummy$Y = as.factor(CART_Dummy$Y) par(mfrow=c(1,2)) plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2") points(CART_Dummy$X1[CART_Dummy$Y==0],CART_Dummy$X2[CART_Dummy$Y==0],pch=15,col="red") points(CART_Dummy$X1[CART_Dummy$Y==1],CART_Dummy$X2[CART_Dummy$Y==1],pch=19,col="green") title(main="A Difficult Classification Problem") plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2") points(CART_Dummy$X1[CART_Dummy$Y==0],CART_Dummy$X2[CART_Dummy$Y==0],pch=15,col="red") points(CART_Dummy$X1[CART_Dummy$Y==1],CART_Dummy$X2[CART_Dummy$Y==1],pch=19,col="green") segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c(3.75,6.25,2.25,5),lwd=2) abline(v=6,lwd=2) title(main="Looks a Solvable Problem Under Partitions")
The data set is adapted from Velleman and Hoaglin (1984). The body temperature of a cow is measured at 6:30am on 75 consecutive days. We use this data set with the intent of explaining the concept of "data smooting". The data appears on page 165 where we have 30 days body temperature.
data(CT)
data(CT)
A data frame with 30 observations on the following 2 variables.
Day
day number
Temperature
temperature at 6:30am
The entire classic book of Velleman and Hoaglin is available at http://dspace.library.cornell.edu/bitstream/1813/78/2/A-B-C_of_EDA_040127.pdf
Velleman, P.F., and Hoaglin, D. (1984). Applications, Basics, and Computing of Exploratory Data Analysis.
data(CT) plot.ts(CT$Temperature,col="red",pch=1)
data(CT) plot.ts(CT$Temperature,col="red",pch=1)
The data pertains to an experiment where the drain current is measured against the ground-to-source voltage. We use this data set for understanding of a simple scatterplot.
data(DCD)
data(DCD)
A data frame with 10 observations on the following 2 variables.
GTS_Voltage
The voltage
Drain_Current
Drain in the current
Montgomery, D. C., and Runger, G. C. (2007). Applied Statistics and Probability for Engineers, (With CD). J.Wiley.
data(DCD) plot(DCD)
data(DCD) plot(DCD)
The data set is used to simply understand the working of read.table, View, class and sapply R functions
data(employ)
data(employ)
A data frame with 60 observations on the following 3 variables.
Trade
a numeric vector
Food
a numeric vector
Metals
a numeric vector
data(employ)
data(employ)
Sir Francis Galton used this data set for understanding the (linear) relationship between the height of parent and its effect on the height of child.
data(galton)
data(galton)
A data frame with 928 observations on the following 2 variables.
child
children's height
parent
parent's height
A scatter plot may be used for preliminary investigation of the kind of relationship between parent's height and their children. A simple linear regression model may also be built for quantifying the relationship.
http://en.wikipedia.org/wiki/Francis_Galton
data(galton) plot(galton)
data(galton) plot(galton)
This data set has been used primarily for understanding a multivariate data set. Multiple regression model is also introduced and discussed completely through this example.
data(Gasoline)
data(Gasoline)
A data frame with 25 observations on the following 12 variables.
y
Miles per gallon
x1
Displacement (cubic inches)
x2
Horsepower (foot-pounds)
x3
Torque (foot-pounds)
x4
Compression ratio
x5
Rear axle ratio
x6
Carburetor (barrels)
x7
Number of transmission speeds
x8
Overall length (inches)
x9
Width (inches)
x10
Weight (pounds)
x11
Type of transmission (A-automatic, M-manual)
Montgomery, D. C., Peck, E.A., and Vining, G.G. (2012). Introduction to linear regression analysis. Wiley.
data(Gasoline)
data(Gasoline)
Loans are an assest for the banks! However, not all the loans are promptly returned and it is thus important for a bank to build a classification model which can identify the loan defaulters from those who complete the loan tenure.
data(GC)
data(GC)
A data frame with 1000 observations on the following 21 variables.
checking
Status of existing checking account
duration
Duration in month
history
Credit history
purpose
Purpose of loan
amount
Credit amount
savings
Savings account or bonds
employed
Present employment since
installp
Installment rate in percentage of disposable income
marital
Personal status and sex
coapp
Other debtors or guarantors
resident
Present residence since
property
Property
age
Age in years
other
Other installment plans
housing
Housing
existcr
Number of existing credits at this bank
job
Job
depends
Number of people being liable to provide maintenance for
telephon
Telephone
foreign
foreign worker
good_bad
Loan Defaulter
http://www.stat.auckland.ac.nz/~reilly/credit-g.arff and http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)
cran.r-project.org/doc/contrib/Sharma-CreditScoring.pdf
data(GC)
data(GC)
The CPU is known to depend on the number of active IO processes. This data set will be used for the purposes of understanding scatterplots, resistant lines, and simple linear regression model.
data(IO_Time)
data(IO_Time)
A data frame with 10 observations on the following 2 variables.
No_of_IO
Number of IO Processes
CPU_Time
The CPU time
http://www.cs.gmu.edu/~menasce/cs700/files/SimpleRegression.pdf
data(IO_Time) plot(IO_Time)
data(IO_Time) plot(IO_Time)
A consolidation of the concepts learnt the later half of the book is worked trough using this example.
data(lowbwt)
data(lowbwt)
A data frame with 189 observations on the following 10 variables.
LOW
indicator of birth weight less than 2.5kg
AGE
mother's age in years
LWT
mother's weight in pounds at last menstrual period
RACE
mothers race ("white", "black", "other")
SMOKE
smoking status during pregnancy
PTL
number of previous premature labours
HT
history of hypertension
UI
presence of uterine irritability
FTV
number of physician visits during the first trimester
BWT
birth weight in grams
http://www.statlab.uni-heidelberg.de/data/linmod/birthweight.html
Hosmer, D.W. and Lemeshow, S. (2001). Applied Logistic Regression. New York: Wiley.
data(lowbwt) plot(lowbwt)
data(lowbwt) plot(lowbwt)
The problem is to understand the effect of the average amount of tobacco smoked and the cause of death on the male death rates per 1000.
data(MDR)
data(MDR)
A data frame with 15 observations on the following 5 variables.
X
Death Causes
G0
No smoking
G14
Between 1-14 grams
G24
Between 15-24 grams
G25
More than 25 grams
http://dspace.library.cornell.edu/bitstream/1813/78/2/A-B-C_of_EDA_040127.pdf
Velleman, Paul F., and David C. Hoaglin. Applications, basics, and computing of exploratory data analysis. Vol. 142. Boston: Duxbury Press, 1981.
data(MDR) boxplot(MDR)
data(MDR) boxplot(MDR)
An experiment is conducted where the octane rating of gasoline blends can be obtained using two methods. Two samples are available for testing each type of blend, and Snee (1981) obtains 32 different blends over an appropriate spectrum of the target octane ratings.
data(octane)
data(octane)
A data frame with 32 observations on the following 2 variables.
Method_1
Ratings under Method 1
Method_2
Ratings under Method 2
Vining, G.G., and Kowalski, S.M. (2011). Statistical Methods for Engineers, 3e. Brooks/Cole.
data(octane) par(mfrow=c(1,2)) hist(octane$Method_1) hist(octane$Method_2) ## maybe str(octane) ; plot(octane) ...
data(octane) par(mfrow=c(1,2)) hist(octane$Method_1) hist(octane$Method_2) ## maybe str(octane) ; plot(octane) ...
This is a data set cooked up by the author to highlight the problem of overfitting. The variables have no physical meaning.
data(OF)
data(OF)
A data frame with 10 observations on the following 2 variables.
X
Just another covariate
Y
Just another output
data(OF) plot(OF)
data(OF) plot(OF)
As with the "OF" data set, this data set has been created by the author to build up the ideas leading up to piecewise linear regression model.
data(PW_Illus)
data(PW_Illus)
A data frame with 100 observations on the following 2 variables.
X
an input vector
Y
an output vector
data(PW_Illus) plot(PW_Illus)
data(PW_Illus) plot(PW_Illus)
The resistivity of wires is known to depend on its manufacturing process. The data set is used primarily to understand the boxplot.
data(resistivity)
data(resistivity)
A data frame with 8 observations on the following 2 variables.
Process.1
Resistivity of wires under process 1
Process.2
Resistivity of wires under process 2
Gunst, R. F. (2002). Finding confidence in statistical significance. Quality Progress, 35 (10), 107-108.
data(resistivity) boxplot(resistivity)
data(resistivity) boxplot(resistivity)
This data set shows that data may also have skewness inherent in them!
data(Samplez)
data(Samplez)
A data frame with 2000 observations on the following 2 variables.
Sample_1
a numeric vector
Sample_2
a numeric vector
data(Samplez) hist(Samplez$Sample_1) hist(Samplez$Sample_2)
data(Samplez) hist(Samplez$Sample_1) hist(Samplez$Sample_2)
The final completion of a stat course is believed to depend on the marks scored by the student during his qualifying SAT-M marks. This data set is used to setup the motivation for binary regression models such as probit and logistic regressino models.
data(sat)
data(sat)
A data frame with 30 observations on the following 5 variables.
Student.No
Student number
Grade
Grade of the student
Pass
Pass-Fail indicator in the final exam
Sat
The SAT-M marks
GPP
The GPP group
Johnson, Valen E., and James H. Albert. Ordinal data modeling. Springer, 1999.
data(sat)
data(sat)
This data set is primarily used to illustrate some basic R functions.
data(SCV)
data(SCV)
A data frame with 16 observations on the following 6 variables.
Response
an output vector
A
variable A
B
variable B
C
Variable C
D
variable D
E
a factor with two levels Modified
Usual
data(SCV)
data(SCV)
This data set is a part of the SCV dataset.
data(SCV_Modified)
data(SCV_Modified)
A data frame with 8 observations on the following 6 variables.
Response
an output vector
A
variable A
B
variable B
C
Variable C
D
variable D
E
a factor with two levels Modified
data(SCV_Modified)
data(SCV_Modified)
This data set is part of the SCV data set.
data(SCV_Usual)
data(SCV_Usual)
A data frame with 8 observations on the following 6 variables.
Response
an output vector
A
variable A
B
variable B
C
Variable C
D
variable D
E
a factor with two levels Usual
data(SCV_Usual)
data(SCV_Usual)
The software system Eclipse JDT Core has 997 different class environments related to the development. The bug identified on each occasion is classified by its severity as Bugs, NonTrivial, Major, Critical, and High. We need to understand the bug counts before- and after- software release.
data(Severity_Counts)
data(Severity_Counts)
Before and after release bug counts at five severity levels for the JDT software.
http://www.eclipse.org/jdt/core/index.php
data(Severity_Counts) barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), col=rep(c(2,3),5))
data(Severity_Counts) barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), col=rep(c(2,3),5))
ROC is an important tool for comparing different models for the same classification problem. This data set comes with barebones infrastructure and is simply complementary in nature towards setting up a clear understanding the ROC construction.
data(simpledata)
data(simpledata)
A data frame with 200 observations on the following 2 variables.
Predictions
Predicted probabilities
Label
True class of the observations
data(simpledata)
data(simpledata)
This data is used to check your understanding of the multiple linear regression model.
data(SPD)
data(SPD)
A data frame with 30 observations on the following 7 variables.
Y
Supervisors performance
X1
Aspect 1
X2
Aspect 2
X3
Aspect 3
X4
Aspect 4
X5
Aspect 5
X6
Aspect 6
"Regression analysis by example" by Samprit Chatterjee and Ali S. Hadi, Wiley
data(SPD) pairs(SPD)
data(SPD) pairs(SPD)
The sample questionnaire data is simply used to familiarize the reader with data and statistical terminologies.
data(SQ)
data(SQ)
A data frame with 20 observations on the following 12 variables.
Customer_ID
Customer ID
Questionnaire_ID
Questionnaire ID
Name
Customers Name
Gender
Customers gender Female
Male
Age
Age of the customer
Car_Model
Car Model's name
Car_Manufacture_Year
Month and year of car's manufacturing
Minor_Problems
Minor problems were fixed by the workshop center indicator No
Yes
Major_Problems
Major problems were fixed by the workshop center indicator No
Yes
Yes
Mileage
The overall mileage of the car (kms/litre)
Odometer
The overall kilometers travelled by the car
Satisfaction_Rating
How satisfied was the customer Very Poor
< Poor
< Average
< Good
< Very Good
data(SQ)
data(SQ)
Rahul Dravid has been a modern arthictet of Indian test cricket team. His resilent centuries and holding the wicket at one end of the cricket pitch has earned him the name "The Wall". We analyze his centuries at "Home" and "Away" test matches.
data(TheWALL)
data(TheWALL)
A data frame with 36 observations on the following 11 variables.
Sl_No
An indicator
Score
The century scores
Not_Out_Indicator
Indicates whether Dravid remained unbeaten at the end of the team innings
Against
The teams against whom Dravid scored the century
Position
Dravid's batting position, out of 11
Innings
An indicator of the first to fourth innings
Test
Test number
Venue
Venue of the test match
HA_Ind
Match was in home country or away
Date
Date on the which the test began
Result
Did India won the match?
data(TheWALL)
data(TheWALL)
The voltage is known to drop in a guided missile after a certain time. The data has been to illustrate certain cubic spline models.
data(VD)
data(VD)
A data frame with 41 observations on the following 2 variables.
Time
Time of missile
Voltage_Drop
Drop in the voltage
Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to linear regression analysis. Wiley, 2012.
data(VD)
data(VD)