| Title: | A Companion Package for the Book "A Course in Statistics with R" |
|---|---|
| Description: | A book designed to meet the requirements of masters students. Tattar, P.N., Suresh, R., and Manjunath, B.G. "A Course in Statistics with R", J. Wiley, ISBN 978-1-119-15272-9. |
| Authors: | Prabhanjan Tattar |
| Maintainer: | Prabhanjan Tattar <[email protected]> |
| License: | GPL-2 |
| Version: | 1.0 |
| Built: | 2026-06-03 10:44:41 UTC |
| Source: | https://github.com/cran/ACSWR |
"A Course in Statistics with R" has been designed to meet the requirements of masters students.
| Package: | ACSWR |
| Type: | Package |
| Version: | 1.0 |
| Date: | 2015-08-19 |
| License: | GPL-2 |
Prabhanjan Tattar
Maintainer: Prabhanjan Tattar <[email protected]>
Tattar, P. N., Suresh, R., and Manjunath, B. G. (2016). A Course in Statistics with R. J. Wiley.
hist(rnorm(100))hist(rnorm(100))
To understand the relationship between the abrasion index for the tire tread, the output y, as a linear function of the hydrated silica level x1, silane coupling agent level x2 and the sulfur level x3, Derringer and Suich (1980) collected data on 14 observation points.
data("abrasion_index")data("abrasion_index")
A data frame with 14 observations on the following 4 variables.
x1hydrated silica level
x2silane coupling agent level
x3sulfur level
yabrasion index for the tire tread
Derringer, G., and Suich, R. (1980). Simultaneous Optimization of Several Response Variables. Journal of Quality Technology, 12, 214-219.
data(abrasion_index) ailm <- lm(y~x1+x2+x3,data=abrasion_index) pairs(abrasion_index)data(abrasion_index) ailm <- lm(y~x1+x2+x3,data=abrasion_index) pairs(abrasion_index)
The data set is obtained from Rencher (2002). Here, a 12-year old girl rates 7 of her acquaintances on a differential grade of 1-9 for five adjectives kind, intelligent, happy, likable, and just.
data(adjectives)data(adjectives)
A data frame with 7 observations on the following 6 variables.
Peoplea factor with levels FATHER FSM1a FSM2 FSM3 MSMb SISTER TEACHER
Kinda numeric vector
Intelligenta numeric vector
Happya numeric vector
Likeablea numeric vector
Justa numeric vector
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(adjectives) adjectivescor <- cor(adjectives[,-1]) round(adjectivescor,3) adj_eig <- eigen(adjectivescor) cumsum(adj_eig$values)/sum(adj_eig$values) adj_eig$vectors[,1:2] loadings1 <- adj_eig$vectors[,1]*sqrt(adj_eig$values[1]) loadings2 <- adj_eig$vectors[,2]*sqrt(adj_eig$values[2]) cbind(loadings1,loadings2) communalities <- (adj_eig$vectors[,1]*sqrt(adj_eig$values[1]))^2+ (adj_eig$vectors[,2]*sqrt(adj_eig$values[2]))^2 round(communalities,3) specific_variances <- 1-communalities round(specific_variances,3) var_acc_factors <- adj_eig$values round(var_acc_factors,3) prop_var <- adj_eig$values/sum(adj_eig$values) round(prop_var,3) cum_prop <- cumsum(adj_eig$values)/sum(adj_eig$values) round(cum_prop,3)data(adjectives) adjectivescor <- cor(adjectives[,-1]) round(adjectivescor,3) adj_eig <- eigen(adjectivescor) cumsum(adj_eig$values)/sum(adj_eig$values) adj_eig$vectors[,1:2] loadings1 <- adj_eig$vectors[,1]*sqrt(adj_eig$values[1]) loadings2 <- adj_eig$vectors[,2]*sqrt(adj_eig$values[2]) cbind(loadings1,loadings2) communalities <- (adj_eig$vectors[,1]*sqrt(adj_eig$values[1]))^2+ (adj_eig$vectors[,2]*sqrt(adj_eig$values[2]))^2 round(communalities,3) specific_variances <- 1-communalities round(specific_variances,3) var_acc_factors <- adj_eig$values round(var_acc_factors,3) prop_var <- adj_eig$values/sum(adj_eig$values) round(prop_var,3) cum_prop <- cumsum(adj_eig$values)/sum(adj_eig$values) round(cum_prop,3)
Gore, et al. (2006) consider the frequencies of cancer deaths of Japanese atomic bomb survivors by extent of exposure, years after exposure, etc. This data set has appeared in the journal "Statistical Sleuth".
data("atombomb")data("atombomb")
A data frame with 84 observations on the following 4 variables.
RadiansExtent of exposure to the radian levels
Count.Typethe type of count At Risk Death Count
Count.Age.Groupage group with levels '0-7' '12-15' '16-19' '20-23' '24-27' '28-41' '8-11'
Frequencythe count of deaths
Gore, A.P., Paranjape, S. A., and Kulkarni, M.B. (2006). 100 Data Sets for Statistics Education. Department of Statistics, University of Pune.
data(atombomb) atombombxtabs <- xtabs(Frequency~Radians+Count.Type+Count.Age.Group,data=atombomb) atombombxtabsdata(atombomb) atombombxtabs <- xtabs(Frequency~Radians+Count.Type+Count.Age.Group,data=atombomb) atombombxtabs
An experiment where the life of a battery is modeled as a function of the extreme variations in temperature of three levels 15, 70, and 1250 Fahrenheit and three type of plate material. Here, the engineer has no control on the temperature variations once the device leaves the factory. Thus, the task of the engineer is to investigate two major problems: (i) The effect of material type and temperature on the life of the device, and (ii) Finding the type of material which has least variation among the varying temperature levels. For each combination of the temperature and material, 4 replications of the life of battery are tested.
data(battery)data(battery)
A data frame with 36 observations on the following 3 variables.
Lifebattery life
Materialthe type of plate material
Temperaturethree extreme variations of temperature
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(battery) names(battery) <- c("L","M","T") battery$M <- as.factor(battery$M) battery$T <- as.factor(battery$T) battery.aov <- aov(L~M*T,data=battery) model.matrix(battery.aov) summary(battery.aov)data(battery) names(battery) <- c("L","M","T") battery$M <- as.factor(battery$M) battery$T <- as.factor(battery$T) battery.aov <- aov(L~M*T,data=battery) model.matrix(battery.aov) summary(battery.aov)
A simple function to understand the algorithm to simulate psuedo-observations from binomial distribution. It is an implementation of the algorithm given in Section 11.3.1. This function is not an alternative to the rbinom function.
Binom_Sim(size, p, N)Binom_Sim(size, p, N)
size |
Size of the binomial distribution |
p |
Denotes the probability of success |
N |
The number of observations required from b(n,p) |
This function is to simply explain the algorithm described in the text. For efficient results, the user should use the rbinom function.
Prabhanjan N. Tattar
rbinom
Binom_Sim(10,0.5,100)Binom_Sim(10,0.5,100)
The height of the fills in the soft drink bottle is required to be as consistent as possible and it is controlled through three factors: (i) the percent carbonation of the drink, (ii) the operating pressure in the filler, and (iii) the line speed which is the number of bottles filled per minute. The first factor variable of the percent of carbonation is available at three levels of 10, 12, and 14, the operating pressure is at 25 and 30 psi units, while the line speed are at 200 and 250 bottles per minute. Two complete replicates are available for each combination of the three factor levels, that is, 24 total number of observations. In this experiment, the deviation from the required height level is measured.
data(bottling)data(bottling)
A data frame with 24 observations on the following 4 variables.
Deviationdeviation from required height level
Carbonationthe percent carbonation of the drink
Pressurethe operating pressure in the filler
Speedthe number of bottles filled per minute
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(bottling) summary(bottling.aov <- aov(Deviation~.^3,bottling)) # Equivalent way summary(aov(Deviation~ Carbonation + Pressure + Speed+ (Carbonation*Pressure)+ (Carbonation*Speed)+(Pressure*Speed)+(Carbonation*Speed*Pressure),data=bottling))data(bottling) summary(bottling.aov <- aov(Deviation~.^3,bottling)) # Equivalent way summary(aov(Deviation~ Carbonation + Pressure + Speed+ (Carbonation*Pressure)+ (Carbonation*Speed)+(Pressure*Speed)+(Carbonation*Speed*Pressure),data=bottling))
The data set is used to understand the sampling variation of the score function. The simulated data is available in Pawitan (2001).
data(bs)data(bs)
A data frame with 10 observations on the following 20 variables.
Sample.1a numeric vector
Sample.2a numeric vector
Sample.3a numeric vector
Sample.4a numeric vector
Sample.5a numeric vector
Sample.6a numeric vector
Sample.7a numeric vector
Sample.8a numeric vector
Sample.9a numeric vector
Sample.10a numeric vector
Sample.11a numeric vector
Sample.12a numeric vector
Sample.13a numeric vector
Sample.14a numeric vector
Sample.15a numeric vector
Sample.16a numeric vector
Sample.17a numeric vector
Sample.18a numeric vector
Sample.19a numeric vector
Sample.20a numeric vector
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
data(bs) n <- 10 sample_means <- colMeans(bs) binomial_score_fn <- function(p,xbar) n*(xbar-10*p)/(p*(1-p)) p <- seq(from=0,to=1,by=0.02) plot(p,sapply(p,binomial_score_fn,xbar=sample_means[1]),"l",xlab=expression(p), ylab=expression(S(p))) title(main="C: Score Function Plot of Binomial Model") for(i in 2:20) lines(p,sapply(p, binomial_score_fn,xbar=sample_means[i]),"l") abline(v=4) abline(h=0)data(bs) n <- 10 sample_means <- colMeans(bs) binomial_score_fn <- function(p,xbar) n*(xbar-10*p)/(p*(1-p)) p <- seq(from=0,to=1,by=0.02) plot(p,sapply(p,binomial_score_fn,xbar=sample_means[1]),"l",xlab=expression(p), ylab=expression(S(p))) title(main="C: Score Function Plot of Binomial Model") for(i in 2:20) lines(p,sapply(p, binomial_score_fn,xbar=sample_means[i]),"l") abline(v=4) abline(h=0)
The problem is to investigate the impact of smoking tobacco among British doctors, refer Example 9.2.1 of Dobson. In the year 1951, a survey was sent across among all the British doctors asking them whether they smoked tobacco and their age group Age_Group. The data also collects the person-years Person_Years of the doctors in the respective age group. A follow-up after ten years reveals the number of deaths Deaths, the smoking group indicator Smoker_Cat.
data(bs1)data(bs1)
A data frame with 10 observations on the following 9 variables.
Age_Groupa factor variable of age group with levels 35-44 45-54 55-64 65-74 75-84
Age_Catslightly re-coded to extract variables with Age_Cat taking values 1-5 respectively for the age groups 35-44, 45-54, 55-64, 65-74, and 75-84
Age_Squaresquare of the variable Age_Cat
Smoker_Catthe smoking group indicator NO YES
Smoke_Inda numeric vector
Smoke_Agetakes the Age_Cat values for the smokers group and 0 for the non-smokers
Deathsa follow-up after ten years revealing the number of deaths
Person_Yearsthe number of deaths standardized to 100000
Deaths_Per_Lakh_Yearsa numeric vector
Dobson (2002)
Dobson, A. J. (1990-2002). An Introduction to Generalized Linear Models, 2e. Chapman & Hall/CRC.
library(MASS) data(bs1) BS_Pois <- glm(Deaths~Age_Cat+Age_Square+Smoke_Ind+Smoke_Age,offset= log(Person_Years),data=bs1,family='poisson') logLik(BS_Pois) summary(BS_Pois) with(BS_Pois, pchisq(null.deviance - deviance,df.null - df.residual,lower.tail = FALSE)) confint(BS_Pois)library(MASS) data(bs1) BS_Pois <- glm(Deaths~Age_Cat+Age_Square+Smoke_Ind+Smoke_Age,offset= log(Person_Years),data=bs1,family='poisson') logLik(BS_Pois) summary(BS_Pois) with(BS_Pois, pchisq(null.deviance - deviance,df.null - df.residual,lower.tail = FALSE)) confint(BS_Pois)
An increasing concern has been the number of cesarean deliveries, especially in the private hospitals. Here, we know the number of births, the type of hospital (private or Government hospital), and the number of cesareans. We would like to model the number of cesareans as a function of the number of births and the type of hospital. A Poisson regression model is fitted for this data set.
data(caesareans)data(caesareans)
A data frame with 20 observations on the following 3 variables.
Birthstotal number of births
Hospital_Typetype of hospital, private or government
Caesareansnumber of caesareans
http://www.oxfordjournals.org/our_journals/tropej/online/ma_chap13.pdf
data(caesareans) names(caesareans) cae_pois <- glm(Caesareans~Hospital_Type+Births,data=caesareans,family='poisson') summary(cae_pois)data(caesareans) names(caesareans) cae_pois <- glm(Caesareans~Hospital_Type+Births,data=caesareans,family='poisson') summary(cae_pois)
Kramer and Jensen (1969) collected data on three variables at 10 different locations. The variables of interest are available calcium in the soil, y1, exchangeable soil calcium, y2, and turnip green calcium, y3. The hypothesis of interest is whether the mean vector is [15.0 6.0 2.85].
data(calcium)data(calcium)
A data frame with 10 observations on the following 4 variables.
Location.Numbera numeric vector
y1a numeric vector
y2a numeric vector
y3a numeric vector
Kramer, C. Y., and Jensen, D. R. (1969). Fundamentals of Multivariate Analysis, Part I. Inference about Means. Journal of Quality Technology, 1 (2), 120-133.
Rencher, A.C. (1990-2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(calcium) n <- nrow(calcium) meanx <- colMeans(calcium[,-1]) varx <- var(calcium[,-1]) mu0 <- c(15,6,2.85) t2 <- n*t(meanx-mu0) t2data(calcium) n <- nrow(calcium) meanx <- colMeans(calcium[,-1]) varx <- var(calcium[,-1]) mu0 <- c(15,6,2.85) t2 <- n*t(meanx-mu0) t2
The data is used to show the effectiveness of Chernoff faces.
data(cardata)data(cardata)
A data frame with 74 observations on the following 14 variables.
Modelvarious car models
PPrice
MMileage (in miles per gallon)
R78Repair record 1978
R77Repair record 1977
HHeadroom (in inches)
RRear seat clearance
TrTrunk space
WWeight (in pound)
LLength (in inches)
TTurning diameter
DDisplacement (in cubic inches)
GGear ratio for high gear
CCompany headquarter
data(cardata) pairs(cardata)data(cardata) pairs(cardata)
A well known explanation of the heart disease is that as the age increases, the risk of coronary heart disease also increase. The current data set and the example may be found in Chapter 1 of Hosmer and Lemeshow (1990-2013).
data(chdage)data(chdage)
A data frame with 100 observations on the following 3 variables.
IDpatient ID
AGEage of the patient
CHDCoronary Heart Disease indicator
Hosmer and Lemeshow (1990-2013).
Hosmer, D.W., and Lemeshow, S. (1990-20013). Applied Logistic Regression, 3e. Wiley.
data(chdage) plot(chdage$AGE,chdage$CHD,xlab="AGE",ylab="CHD Indicator", main="Scatter plot for CHD Data") agegrp <- cut(chdage$AGE,c(19,29,34,39,44,49,54,59,69),include.lowest=TRUE, labels=c(25,seq(31.5,56.5,5),64.5)) mp <- c(25,seq(31.5,56.5,5),64.5) # mid-points chd_percent <- prop.table(table(agegrp,chdage$CHD),1)[,2] points(mp,chd_percent,"l",col="red")data(chdage) plot(chdage$AGE,chdage$CHD,xlab="AGE",ylab="CHD Indicator", main="Scatter plot for CHD Data") agegrp <- cut(chdage$AGE,c(19,29,34,39,44,49,54,59,69),include.lowest=TRUE, labels=c(25,seq(31.5,56.5,5),64.5)) mp <- c(25,seq(31.5,56.5,5),64.5) # mid-points chd_percent <- prop.table(table(agegrp,chdage$CHD),1)[,2] points(mp,chd_percent,"l",col="red")
This data set is used to illustrate the concept of canonical correlations. Here, temperature, concentration, and time have influence on three yield variables, namely outputs, while the percentage of unchanged starting material, the percentage converted to the desired product, and the percentage of unwanted by-product form another set of related variables.
data(chemicaldata)data(chemicaldata)
A data frame with 19 observations on the following 6 variables.
y1the percentage of unchanged starting material
y2the percentage converted to the desired product
y3the percentage of unwanted by-product
x1temperature
x2concentration
x3time
Box, G. E. P., and Youle, P. V. (1955). The Exploration of Response Surfaces: An Example of the Link between the Fitted Surface and the Basic Mechanism of the System. Biometrics, 11, 287-323.
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(chemicaldata) names(chemicaldata) chemicaldata$x12 <- chemicaldata$x1*chemicaldata$x2; chemicaldata$x13 <- chemicaldata$x1*chemicaldata$x3; chemicaldata$x23 <- chemicaldata$x2*chemicaldata$x3 chemicaldata$x1sq <- chemicaldata$x1^{2} chemicaldata$x2sq <- chemicaldata$x2^{2} chemicaldata$x3sq <- chemicaldata$x3^{2} S_Total <- cov(chemicaldata) cancor_xy <- sqrt(eigen(solve(S_Total[1:3,1:3])%*%S_Total[1:3, 4:12]%*%solve(S_Total[4:12,4:12])%*%S_Total[4:12,1:3])$values) cancor_xy cancor(chemicaldata[,1:3],chemicaldata[,4:12])data(chemicaldata) names(chemicaldata) chemicaldata$x12 <- chemicaldata$x1*chemicaldata$x2; chemicaldata$x13 <- chemicaldata$x1*chemicaldata$x3; chemicaldata$x23 <- chemicaldata$x2*chemicaldata$x3 chemicaldata$x1sq <- chemicaldata$x1^{2} chemicaldata$x2sq <- chemicaldata$x2^{2} chemicaldata$x3sq <- chemicaldata$x3^{2} S_Total <- cov(chemicaldata) cancor_xy <- sqrt(eigen(solve(S_Total[1:3,1:3])%*%S_Total[1:3, 4:12]%*%solve(S_Total[4:12,4:12])%*%S_Total[4:12,1:3])$values) cancor_xy cancor(chemicaldata[,1:3],chemicaldata[,4:12])
Militia means an army composed of ordinary citizens and not of professional soldiers. This data set is available in an 1846 book published by the Belgian statistician Adolphe Quetelet, and the data is believed to have been collected some thirty years before that.
data(chest)data(chest)
A data frame with 16 observations on the following 2 variables.
Chestchest width measured in inches
Countthe corresponding frequencies
Velleman, P.F., and Hoaglin, D.C. (2004). ABC of Exploratory Data Analysis. Duxbury Press, Boston.
data(chest) attach(chest) names(chest) militiamen <- rep(Chest,Count) length(militiamen) bins <- seq(33,48) bins bin.mids <- (bins[-1]+bins[-length(bins)])/2 par(mfrow=c(1,2)) h <- hist(militiamen, breaks = bins, xlab= "Chest Measurements (Inches)", main= "A: Histogram for the Militiamen") h$counts <- sqrt(h$counts) plot(h,xlab= "Chest Measurements (Inches)",ylab= "ROOT FREQUENCY", main= "B: Rootogram for the Militiamen")data(chest) attach(chest) names(chest) militiamen <- rep(Chest,Count) length(militiamen) bins <- seq(33,48) bins bin.mids <- (bins[-1]+bins[-length(bins)])/2 par(mfrow=c(1,2)) h <- hist(militiamen, breaks = bins, xlab= "Chest Measurements (Inches)", main= "A: Histogram for the Militiamen") h$counts <- sqrt(h$counts) plot(h,xlab= "Chest Measurements (Inches)",ylab= "ROOT FREQUENCY", main= "B: Rootogram for the Militiamen")
Chambers, et al. (1983), page 381, contains the cloud seeding data set. Rainfall in acre-feet for 52 clouds are measured, 50% of which have natural rain (control group) whereas the others are seeded. We need to visually compare whether seeding the clouds lead to increase in rainfall in acre-feet.
data(cloud)data(cloud)
A data frame with 26 observations on the following 2 variables.
ControlRainfall in acre-feet for 26 clouds are measured which had natural rain, that is, control group
SeededRainfall in acre-feet for 26 clouds are measured which had seeded rain
Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. (1983). Graphical Methods for Data Analysis. Wadsworth and Brooks/Cole.
data(cloud) stem(log(cloud$Seeded),scale=1) stem(log(cloud$Control),scale=1)data(cloud) stem(log(cloud$Seeded),scale=1) stem(log(cloud$Control),scale=1)
Thickness of cork borings in four directions of North, South, East, and West are measured for 28 trees. The problem here is to examine if the bark deposit is same in all the directions.
data(cork)data(cork)
A data frame with 28 observations on the following 4 variables.
Norththickness of cork boring in the North direction
Eastthickness of cork boring in the East direction
Souththickness of cork boring in the South direction
Westthickness of cork boring in the West direction
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2e. J. Wiley.
data(cork) corkcent <- cork*0 corkcent[,1] <- cork[,1]-mean(cork[,1]) corkcent[,2] <- cork[,2]-mean(cork[,2]) corkcent[,3] <- cork[,3]-mean(cork[,3]) corkcent[,4] <- cork[,4]-mean(cork[,4]) corkcentsvd <- svd(corkcent) t(corkcentsvd$u)%*%corkcentsvd$u t(corkcentsvd$v)%*%corkcentsvd$v round(corkcentsvd$u %*% diag(corkcentsvd$d) %*% t(corkcentsvd$v),2) round(corkcent,2) corkcentsvd$ddata(cork) corkcent <- cork*0 corkcent[,1] <- cork[,1]-mean(cork[,1]) corkcent[,2] <- cork[,2]-mean(cork[,2]) corkcent[,3] <- cork[,3]-mean(cork[,3]) corkcent[,4] <- cork[,4]-mean(cork[,4]) corkcentsvd <- svd(corkcent) t(corkcentsvd$u)%*%corkcentsvd$u t(corkcentsvd$v)%*%corkcentsvd$v round(corkcentsvd$u %*% diag(corkcentsvd$d) %*% t(corkcentsvd$v),2) round(corkcent,2) corkcentsvd$d
The data set is used to understand the sampling variation of the score function. The simulated data is available in Pawitan (2001).
data(cs)data(cs)
A data frame with 10 observations on the following 20 variables.
Sample.1a numeric vector
Sample.2a numeric vector
Sample.3a numeric vector
Sample.4a numeric vector
Sample.5a numeric vector
Sample.6a numeric vector
Sample.7a numeric vector
Sample.8a numeric vector
Sample.9a numeric vector
Sample.10a numeric vector
Sample.11a numeric vector
Sample.12a numeric vector
Sample.13a numeric vector
Sample.14a numeric vector
Sample.15a numeric vector
Sample.16a numeric vector
Sample.17a numeric vector
Sample.18a numeric vector
Sample.19a numeric vector
Sample.20a numeric vector
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
data(cs) n <- 10 cauchy_score_fn <- function(mu,x) sum(2*(x-mu)/(1+(x-mu)^{2})) mu <- seq(from=-15,to=20,by=0.5) plot(mu,sapply(mu,cauchy_score_fn,x=cs[,1]),"l",xlab=expression(mu), ylab=expression(S(mu)),ylim=c(-10,10)) title(main="D: Score Function Plot of Cauchy Model") for(i in 2:20) lines(mu,sapply(mu, cauchy_score_fn,x=cs[,i]),"l") abline(v=4) abline(h=0)data(cs) n <- 10 cauchy_score_fn <- function(mu,x) sum(2*(x-mu)/(1+(x-mu)^{2})) mu <- seq(from=-15,to=20,by=0.5) plot(mu,sapply(mu,cauchy_score_fn,x=cs[,1]),"l",xlab=expression(mu), ylab=expression(S(mu)),ylim=c(-10,10)) title(main="D: Score Function Plot of Cauchy Model") for(i in 2:20) lines(mu,sapply(mu, cauchy_score_fn,x=cs[,i]),"l") abline(v=4) abline(h=0)
Hamilton depression scale factor IV is a measurement of mixed anxiety and depression and it is named after its inventor. In a double-blind experiment, this scale factor is obtained for 9 patients on their entry in a study, denoted by X. Post a tranquilizer T, the scale factor IV is again obtained for the same set of patients, which is denoted by Y. Here, an improvement due to tranquilizer T corresponds to a reduction in factor IV values.
data(depression)data(depression)
A data frame with 9 observations on the following 3 variables.
Patient_NoPatient ID
Xmeasurement of depression at entry in a study
Ymeasurement of depression post a tranquilizer
Sheshkin, D. J. (1997-2011). Handbook of Parametric and Nonparametric Statistical Procedures, 5e. Chapman and Hall/CRC.
data(depression) attach(depression) names(depression) wilcox.test(Y-X, alternative = "less") wilcox.test(Y-X, alternative = "less",exact=FALSE,correct=FALSE)data(depression) attach(depression) names(depression) wilcox.test(Y-X, alternative = "less") wilcox.test(Y-X, alternative = "less",exact=FALSE,correct=FALSE)
The purpose of this health study is investigation of an epidemic outbreak due to mosquitoes. A random sample from two sectors of the city among the individuals has been tested to determine if the individual had contracted the disease forming the binary outcome.
data(Disease)data(Disease)
A data frame with 98 observations on the following 5 variables.
x1age
x2socioeconomic status of three categories between x2 and x3
x3socioeconomic status of three categories between x2 and x3
x4sector of the city
yif the individual had contracted the disease forming the binary outcome
Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (1974-2005). Applied Linear Statistical Models, 5e. McGraw-Hill.
data(Disease) DO_LR <- glm(y~.,data=Disease,family='binomial') LR_Residuals <- data.frame(Y = Disease$y,Fitted = fitted(DO_LR), Hatvalues = hatvalues(DO_LR),Response = residuals(DO_LR,"response"), Deviance = residuals(DO_LR,"deviance"), Pearson = residuals(DO_LR,"pearson"), Pearson_Standardized = residuals(DO_LR,"pearson")/sqrt(1-hatvalues(DO_LR))) LR_Residualsdata(Disease) DO_LR <- glm(y~.,data=Disease,family='binomial') LR_Residuals <- data.frame(Y = Disease$y,Fitted = fitted(DO_LR), Hatvalues = hatvalues(DO_LR),Response = residuals(DO_LR,"response"), Deviance = residuals(DO_LR,"deviance"), Pearson = residuals(DO_LR,"pearson"), Pearson_Standardized = residuals(DO_LR,"pearson")/sqrt(1-hatvalues(DO_LR))) LR_Residuals
The Ehrenfest model is an interesting example of a Markov chain. Though the probabilities in decimals are not as interesting as expressed in fractions, the function will help the reader generate the transition probability matrices of 2n balls among two urns.
Ehrenfest(n)Ehrenfest(n)
n |
2n will be the number of balls in the urns. |
In this experiment there are i balls in Urn I, and remaining 2n-i balls in Urn II. Then at any instance, the probability of selecting a ball from Urn I and placing it in Urn II is i/2n, and the other way of placing a ball from Urn II to Urn I is (2n-i)/2n. At each instant we let the number i of balls in the Urn I to be the state of the system. Thus, the state space is S = 0, 1, 2, ..., 2n . Then we can pass from state i only to either of the states i-1 or i+1. Here, S = 0, 1, ..., 2n.
Prabhanjan N. Tattar
Ehrenfest(2) Ehrenfest(3)Ehrenfest(2) Ehrenfest(3)
Injuries in airflights, road accidents, etc, are instances of rare occurrences which are appropriately modeled by a Poisson distribution. Two models, before and after transformation, are fit and it is checked if the transformation led to a reduction to the variance.
data(flight)data(flight)
A data frame with 9 observations on the following 2 variables.
Injury_IncidentsCount of injury incidents
Total_FlightsTotal number of flights
Chatterjee, S., and Hadi, A. S. (1977-2006). Regression Analysis by Examples, 4e. J. Wiley.
data(flight) names(flight) injurylm <- lm(Injury_Incidents~Total_Flights,data=flight) injurysqrtlm <- lm(sqrt(Injury_Incidents)~Total_Flights,data=flight) summary(injurylm) summary(injurysqrtlm)data(flight) names(flight) injurylm <- lm(Injury_Incidents~Total_Flights,data=flight) injurysqrtlm <- lm(sqrt(Injury_Incidents)~Total_Flights,data=flight) summary(injurylm) summary(injurysqrtlm)
A simple function to understand the algorithm to simulate (psuedo-)observations from binomial distribution. It is an implementation of the algorithm given in Section 11.3.1 "Simulation from Discrete Distributions", and as such the function is not an alternative to the "rgeom" function.
Geom_Sim(p, n)Geom_Sim(p, n)
p |
probability of success |
n |
number of pseudo-observations required |
To simulate a random number from geometric RV, we make use of the algorithm described in the book.
Prabhanjan N. Tattar
rgeom
mean(Geom_Sim(0.01,10))mean(Geom_Sim(0.01,10))
The shear strength of steel plate girders need to be modeled as a function of the four methods and nine girders.
data(girder)data(girder)
A data frame with 9 observations on the following 5 variables.
GirderThe row names, varying from S1.1 to S4.2, represent the nine type of girders, S1.1 S1.2 S2.1 S2.2 S3.1 S3.2 S4.1 S4.2 S5.1
Aarauone of the four methods of preparation of the steel plates
Karisruheone of the four methods of preparation of the steel plates
Lehighone of the four methods of preparation of the steel plates
Cardiffone of the four methods of preparation of the steel plates
Wu, C.F.J. and M. Hamada (2000-9). Experiments: Planning, Analysis, and Parameter Design Optimization, 2e. J. Wiley.
data(girder) girder boxplot(girder[,2:5])data(girder) girder boxplot(girder[,2:5])
Four types of tip are used which form the blocks in this experiment. The variable of interest is the hardness which further depends on the type of metal coupon. For each type of the tip, the hardness is observed for 4 different types the metal coupon.
data(hardness)data(hardness)
A data frame with 16 observations on the following 3 variables.
Tip_TypeFour types of tip which form the blocks
Test_CouponFour different types of metal coupons
HardnessHardness of the coupon
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(hardness) hardness$Tip_Type <- as.factor(hardness$Tip_Type) hardness$Test_Coupon <- as.factor(hardness$Test_Coupon) hardness_aov <- aov(Hardness~Tip_Type+Test_Coupon,data=hardness) summary(hardness_aov)data(hardness) hardness$Tip_Type <- as.factor(hardness$Tip_Type) hardness$Test_Coupon <- as.factor(hardness$Test_Coupon) hardness_aov <- aov(Hardness~Tip_Type+Test_Coupon,data=hardness) summary(hardness_aov)
A study was carried in the Eastman Kodak Company which involved the measurement of hearing loss. Such studies are called as audiometric study. This data set contains 100 males, each aged 39, who had no indication of noise exposure or hearing disorders. Here, the individual is exposed to a signal of a given frequency with an increasing intensity till the signal is perceived.
data(hearing)data(hearing)
A data frame with 100 observations on the following 9 variables.
Sl_NoSerial Number
L500Observation for 500Hz in the left ear
L1000Observation for 1000Hz in the left ear
L2000Observation for 2000Hz in the left ear
L4000Observation for 4000Hz in the left ear
R500Observation for 500Hz in the right ear
R1000Observation for 1000Hz in the right ear
R2000Observation for 2000Hz in the right ear
R4000Observation for 4000Hz in the right ear
Jackson, J.E. (1991). A User's Guide to Principal Components. New York: Wiley.
data(hearing) round(cor(hearing[,-1]),2) round(cov(hearing[,-1]),2) hearing.pc <- princomp(hearing[,-1]) screeplot(hearing.pc,main="B: Scree Plot for Hearing Loss Data")data(hearing) round(cor(hearing[,-1]),2) round(cov(hearing[,-1]),2) hearing.pc <- princomp(hearing[,-1]) screeplot(hearing.pc,main="B: Scree Plot for Hearing Loss Data")
The data set highlights the importance of handling covariance when such information is available. If the covariance is not incorporated, hypothesis testing may lead to entirely difference conclusion.
data(hw)data(hw)
A data frame with 20 observations on the following 2 variables.
Heightthe height of an individual
Weightthe weight of an individual
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(hw) sigma0 <- matrix(c(20, 100, 100, 1000),nrow=2) sigma <- var(hw) v <- nrow(hw)-1 p <- ncol(hw) u <- v*(log(det(sigma0))-log(det(sigma)) + sum(diag(sigma%*%solve(sigma0)))-p) u1 <- (1- (1/(6*v-1))*(2*p+1 - 2/(p+1)))*u u;u1;qchisq(1-0.05,p*(p+1)/2)data(hw) sigma0 <- matrix(c(20, 100, 100, 1000),nrow=2) sigma <- var(hw) v <- nrow(hw)-1 p <- ncol(hw) u <- v*(log(det(sigma0))-log(det(sigma)) + sum(diag(sigma%*%solve(sigma0)))-p) u1 <- (1- (1/(6*v-1))*(2*p+1 - 2/(p+1)))*u u;u1;qchisq(1-0.05,p*(p+1)/2)
Montgomery (2005), page 42, describes this data set in which the number of days taken by the company to process and settle the claims of employee health insurance customers. The data is recorded for the number of days for settlement from the first to fortieth claim.
data(insurance)data(insurance)
A data frame with 40 observations on the following 2 variables.
ClaimClaim number
DaysDays to settle the claim amount
Montgomery, D.C. (1985-2012). Introduction to Statistical Quality Control, 7e. J. Wiley.
data(insurance) plot(insurance$Claim,insurance$Days,"l",xlab="Claim Sequence", ylab="Time to Settle the Claim") title("B: Run Chart for Insurance Claim Settlement")data(insurance) plot(insurance$Claim,insurance$Days,"l",xlab="Claim Sequence", ylab="Time to Settle the Claim") title("B: Run Chart for Insurance Claim Settlement")
The intent of this experiment is to help the engineer in improving the ability of detecting targets on a radar system. The two variables chosen which are believed to have the most impact on the detecting abilities of the radar system are marked as the amount of the background noise and the type of filter on the screen.
data(intensity)data(intensity)
A data frame with 24 observations on the following 4 variables.
Intensityintensity of targets
Operatordifferent operators who form the blocks 1 2 3 4
Filtertwo types of filter 1 2
Groundthe type of background noise high low medium
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(intensity) intensity.aov <- aov(Intensity~Ground*Filter+Error(Operator),intensity) summary(intensity.aov) intensity.aovdata(intensity) intensity.aov <- aov(Intensity~Ground*Filter+Error(Operator),intensity) summary(intensity.aov) intensity.aov
A simple function to obtain the coefficient of kurtosis on numeric variables.
kurtcoeff(x)kurtcoeff(x)
x |
the numeric vector for which the coefficient of kurtosis is required |
A straight-forward implementation of the formula is give here. A complete function "kurtosis" is available in the "e1071" package.
Prabhanjan N. Tattar
e1071::kurtosis
This data set consists of life expectancy in years by country, age, and sex.
data(life)data(life)
A data frame with 31 observations on the following 8 variables.
m0life expectancy for males at age 0
m25life expectancy for males at age 25
m50life expectancy for males at age 50
m75life expectancy for males at age 75
w0life expectancy for females at age 0
w25life expectancy for females at age 25
w50life expectancy for females at age 50
w75life expectancy for females at age 75
Everitt, B. S., and Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R. Springer.
data(life) factanal(life,factors=1)$PVAL factanal(life,factors=2)$PVAL factanal(life,factors=3)data(life) factanal(life,factors=1)$PVAL factanal(life,factors=2)$PVAL factanal(life,factors=3)
Low birth weight of new-born infants is a serious concern. If the weight of the new-born is less than 2500 grams, we consider that instance as a low-birth weight case. A study was carried out at Baystate Medical Center in Springfield, Massachusetts.
data(lowbwt)data(lowbwt)
A data frame with 189 observations on the following 10 variables.
LOWLow Birth Weight
AGEAge of Mother
LWTWeight of Mother at Last Menstrual Period
RACERace 1 2 3
SMOKESmoking Status During Pregnancy
PTLHistory of Premature Labor
HTHistory of Hypertension
UIPresence of Uterine Irritability
FTVNumber of Physician Visits During the First Trimester
BWTBirth Weight
Hosmer, D.W., and Lemeshow, S. (1989-2000). Applied Logistic Regression, 2e. J. Wiley.
data(lowbwt) lowglm <- glm(LOW~AGE+LWT+RACE+FTV,data=lowbwt,family='binomial') lowglm$coefficientsdata(lowbwt) lowglm <- glm(LOW~AGE+LWT+RACE+FTV,data=lowbwt,family='binomial') lowglm$coefficients
This function sets up the likelihood ratio test for equality of means when the variance term is unknown. Refer Chapter 7 for more details.
LRNormal2Mean(x, y, alpha)LRNormal2Mean(x, y, alpha)
x |
Observations from Population 1 |
y |
Observations from Population 2 |
alpha |
Size alpha test |
Likelihood ratio test is setup through this function. For more details, refer Chapter 7 of the book.
Prabhanjan N. Tattar
t.test
lisa <- c(234.26, 237.18, 238.16, 259.53, 242.76, 237.81, 250.95, 277.83) mike <- c(187.73, 206.08, 176.71, 213.69, 224.34, 235.24) LRNormal2Mean(mike,lisa,0.05)lisa <- c(234.26, 237.18, 238.16, 259.53, 242.76, 237.81, 250.95, 277.83) mike <- c(187.73, 206.08, 176.71, 213.69, 224.34, 235.24) LRNormal2Mean(mike,lisa,0.05)
Likelihood ratio test for equality of mean when the variance is known for a sample from normal distribution is setup here. For details, refer Chapter 7 of the book.
LRNormalMean_KV(x, mu0, alpha, sigma)LRNormalMean_KV(x, mu0, alpha, sigma)
x |
the variable of interest |
mu0 |
the mean of interest |
alpha |
size of the LR test |
sigma |
value of the known standard deviation |
Prabhanjan N. Tattar
t.test
data(hw) LRNormalMean_KV(hw$Height,mu0=70, alpha=0.05, sigma=sqrt(20))data(hw) LRNormalMean_KV(hw$Height,mu0=70, alpha=0.05, sigma=sqrt(20))
Likelihood ratio test for mean when variance is unknown for a sample from normal distribution is setup here.
LRNormalMean_UV(x, mu0, alpha)LRNormalMean_UV(x, mu0, alpha)
x |
the variable of interest |
mu0 |
the mean value of interest |
alpha |
size of the LR test |
Prabhanjan N. Tattar
LRNormalMean_KV
This function returns the LR test for the variance of normal distribution with the mean being unknown. Refer Chapter 7 for more details.
LRNormalVariance_UM(x, sigma0, alpha)LRNormalVariance_UM(x, sigma0, alpha)
x |
the vector of sample values |
sigma0 |
the standard deviation of interest under the hypothesis |
alpha |
the required level of significance |
Prabhanjan Tattar
LRNormalVariance_UM(rnorm(20),1,0.05)LRNormalVariance_UM(rnorm(20),1,0.05)
This function is adapted from Prof. Jim Albert's "LearnEDA" package. It returns the letter values as discussed in Chapter 4.
lval(x, na.rm = TRUE)lval(x, na.rm = TRUE)
x |
the variable of interest |
na.rm |
the default setting removes the missing values |
Prabhanjan Tattar
LearnEDA
A test had been conducted with the purpose of investigating if people recollect pleasant memories associated with a word earlier than some unpleasant memory related with the same word. The word is flashed on the screen and the time an individual takes to respond via keyboard is recorded for both type of the memories.
data(memory)data(memory)
A data frame with 20 observations on the following 2 variables.
Pleasant.memorytime to recollect pleasant memory
Unpleasant.memorytime to recollect unpleasant memory
Dunn, and Master. (1982). Obtained from
http://openlearn.open.ac.uk/mod/resource/view.php?id=165509
data(memory) lapply(memory,fivenum) lapply(memory,mad) lapply(memory,IQR)data(memory) lapply(memory,fivenum) lapply(memory,mad) lapply(memory,IQR)
A psychological study consisting of four tests was conducted on males and females group and the results were noted. Since the four tests are correlated and each one is noted for all the individuals, one is interested to know if the mean vector of the test scores is same across the gender group.
data(mfp)data(mfp)
A data frame with 32 observations on the following 8 variables.
M_y1pictorial inconsistencies for males
M_y2paper form board test for males
M_y3tool recognition test for males
M_y4vocabulary test for males
F_y1pictorial inconsistencies for females
F_y2paper form board test for females
F_y3tool recognition test for females
F_y4vocabulary test for females
data(mfp) males <- mfp[,1:4]; females <- mfp[,5:8] nm <- nrow(males);nf <- nrow(females) p <- 4; k <- 2 vm <- nm-1; vf <- nf-1 meanm <- colMeans(males); meanf <- colMeans(females) sigmam <- var(males); sigmaf <- var(females) sigmapl <- (1/(nm+nf-2))*((nm-1)*sigmam+(nf-1)*sigmaf) ln_M <- .5*(vm*log(det(sigmam))+vf*log(det(sigmaf))) -.5*(vm+vf)*log(det(sigmapl)) exact_test <- -2*ln_M # the Exact Test exact_testdata(mfp) males <- mfp[,1:4]; females <- mfp[,5:8] nm <- nrow(males);nf <- nrow(females) p <- 4; k <- 2 vm <- nm-1; vf <- nf-1 meanm <- colMeans(males); meanf <- colMeans(females) sigmam <- var(males); sigmaf <- var(females) sigmapl <- (1/(nm+nf-2))*((nm-1)*sigmam+(nf-1)*sigmaf) ln_M <- .5*(vm*log(det(sigmam))+vf*log(det(sigmaf))) -.5*(vm+vf)*log(det(sigmapl)) exact_test <- -2*ln_M # the Exact Test exact_test
The function returns the level alpha MP test for the testing the hypothesis H:p=p0 against K:p=p_1 as ensured by the application of Neyman-Pearson lemma.
MPbinomial(Hp, Kp, alpha, n)MPbinomial(Hp, Kp, alpha, n)
Hp |
the value of p under hypothesis H |
Kp |
the value of p under hypothesis K |
alpha |
size of the test |
n |
sample size |
Prabhanjan N. Tattar
binom.test
The most powerful test for a sample from normal distribution is given here. The test is obtained by an application of the Neyman-Pearson lemma.
MPNormal(mu0, mu1, sigma, n, alpha)MPNormal(mu0, mu1, sigma, n, alpha)
mu0 |
mean under hypothesis H |
mu1 |
mean under hypothesis K |
sigma |
standard deviation |
n |
sample size |
alpha |
size of the test |
Prabhanjan N. Tattar
t.test
The most powerful test for a sample from Poisson distribution is given here. The test is obtained by an application of the Neyman-Pearson lemma.
MPPoisson(Hlambda, Klambda, alpha, n)MPPoisson(Hlambda, Klambda, alpha, n)
Hlambda |
parameter under hypothesis H |
Klambda |
parameter under hypothesis K |
alpha |
size of the MP test |
n |
sample size |
Prabhanjan N. Tattar
The m-step transition probability matrix computation is provided in this function. The equation is based on the well-known "Chapman-Kolmogorov equation".
msteptpm(TPM, m)msteptpm(TPM, m)
TPM |
a transition probability matrix |
m |
the m step required |
Prabhanjan N. Tattar
EF2 <- Ehrenfest(2) msteptpm(as.matrix(EF2),4)EF2 <- Ehrenfest(2) msteptpm(as.matrix(EF2),4)
Table 6.1 of Hollander and Wolfe (1999) lists the data for Half-Time of Mucociliary Clearance. We need to test if the time across various treatments is equal or not.
data(Mucociliary)data(Mucociliary)
A data frame with 14 observations on the following 2 variables.
Treatmenttreatment levels Asbestosis Normal Subjects Obstructive Airways Disease
Timehalf-time of mucociliary clearance
Hollander, M., and Wolfe, D. A. (1973-99). Nonparametric Statistical Methods, 2e. J. Wiley.
data(Mucociliary) Mucociliary$Rank <- rank(Mucociliary$Time) aggregate(Mucociliary$Rank,by=list(Mucociliary$Treatment),sum) kruskal.test(Time~Treatment,data=Mucociliary)data(Mucociliary) Mucociliary$Rank <- rank(Mucociliary$Time) aggregate(Mucociliary$Rank,by=list(Mucociliary$Treatment),sum) kruskal.test(Time~Treatment,data=Mucociliary)
The Nerve data set has been popularized by Cox and Lewis (1966). In this experiment 799 waiting times are recorded for successive pulses along a nerve fiber.
data(nerve)data(nerve)
The format is: num [1:799] 0.21 0.03 0.05 0.11 0.59 0.06 0.18 0.55 0.37 0.09 ...
Cox, D. and Lewis, P. (1966). The Statistical Analysis of Series of Events. Chapman & Hall.
data(nerve) nerve_ecdf <- ecdf(nerve) knots(nerve_ecdf) # Returns the jump points of the edf summary(nerve_ecdf) # the usual R summaries nerve_ecdf(nerve) # returns the percentiles at the data pointsdata(nerve) nerve_ecdf <- ecdf(nerve) knots(nerve_ecdf) # Returns the jump points of the edf summary(nerve_ecdf) # the usual R summaries nerve_ecdf(nerve) # returns the percentiles at the data points
The data set is used to understand the sampling variation of the score function. The simulated data is available in Pawitan (2001).
data(ns)data(ns)
A data frame with 10 observations on the following 20 variables.
Sample.1a numeric vector
Sample.2a numeric vector
Sample.3a numeric vector
Sample.4a numeric vector
Sample.5a numeric vector
Sample.6a numeric vector
Sample.7a numeric vector
Sample.8a numeric vector
Sample.9a numeric vector
Sample.10a numeric vector
Sample.11a numeric vector
Sample.12a numeric vector
Sample.13a numeric vector
Sample.14a numeric vector
Sample.15a numeric vector
Sample.16a numeric vector
Sample.17a numeric vector
Sample.18a numeric vector
Sample.19a numeric vector
Sample.20a numeric vector
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
library(stats4) data(ns) x <- ns[,1] nlogl <- function(mean,sd) { -sum(dnorm(x,mean=mean,sd=sd,log=TRUE)) } norm_mle <- mle(nlogl,start=list(mean=median(x),sd=IQR(x)),nobs=length(x)) summary(norm_mle)library(stats4) data(ns) x <- ns[,1] nlogl <- function(mean,sd) { -sum(dnorm(x,mean=mean,sd=sd,log=TRUE)) } norm_mle <- mle(nlogl,start=list(mean=median(x),sd=IQR(x)),nobs=length(x)) summary(norm_mle)
We need to determine the effect of the number of revolutions per minute (rpm) of the rotary pump head of an Olson heart-lung pump on the fluid flow rate Liters_minute. The rpm's are replicated at 50, 75, 100, 125, and 150 levels with respective frequencies 5, 3, 5, 2, and 5. The fluid flow rate is measured in litters per minute.
data(olson)data(olson)
A data frame with 20 observations on the following 4 variables.
Observationobservation number
rpmrmp levels at 50, 75, 100, 125, and 150
Levelthe rpm levels
Liters_minutelitters per minute
Dean, A., and Voss, D. (1999). Design and Analysis of Experiments. Springer.
data(olson) par(mfrow=c(2,2)) plot(olson$rpm,olson$Liters_minute,xlim=c(25,175),xlab="RPM", ylab="Flow Rate",main="Scatter Plot") boxplot(Liters_minute~rpm,data=olson,main="Box Plots") aggregate(olson$Liters_minute,by=list(olson$rpm),mean) olson_crd <- aov(Liters_minute ~ as.factor(rpm), data=olson)data(olson) par(mfrow=c(2,2)) plot(olson$rpm,olson$Liters_minute,xlim=c(25,175),xlab="RPM", ylab="Flow Rate",main="Scatter Plot") boxplot(Liters_minute~rpm,data=olson,main="Box Plots") aggregate(olson$Liters_minute,by=list(olson$rpm),mean) olson_crd <- aov(Liters_minute ~ as.factor(rpm), data=olson)
A simple function is given here which returns the density function values for a Pareto RV. A more efficient implementation is obtainable in the function "dpareto" from the "VGAM" package.
pareto_density(x, scale, shape)pareto_density(x, scale, shape)
x |
the x value |
scale |
the scale parameter of Pareto RV |
shape |
the shape parameter of Pareto RV |
Prabhanjan N. Tattar
VGAM::dpareto
m <- 9184 n <- 103 b <- 10000 K <- 10 theta <- seq(1000,20000,500) plot(theta,as.numeric(sapply(theta,pareto_density,scale=b,shape=K)),"l", xlab=expression(theta),ylab="The Posterior Density") (n+1)*m/nm <- 9184 n <- 103 b <- 10000 K <- 10 theta <- seq(1000,20000,500) plot(theta,as.numeric(sapply(theta,pareto_density,scale=b,shape=K)),"l", xlab=expression(theta),ylab="The Posterior Density") (n+1)*m/n
A simple function is given here which returns the quantiles for a Pareto RV. A more efficient implementation is obtainable in the function "qpareto" from the "VGAM" package.
pareto_quantile(p, scale, shape)pareto_quantile(p, scale, shape)
p |
the percentiles required |
scale |
scale of Pareto RV |
shape |
shape of Pareto RV |
Prabhanjan N. Tattar
VGAM::qpareto
pareto_quantile(c(0.05,0.95),scale=10000,shape=10)pareto_quantile(c(0.05,0.95),scale=10000,shape=10)
A simple function to understand the algorithm to simulate (psuedo-)observations from binomial distribution. It is an implementation of the algorithm given in Section 11.3.1 "Simulation from Discrete Distributions". This function is not an alternative to the "rpois" function.
Poisson_Sim(lambda, n)Poisson_Sim(lambda, n)
lambda |
rate of the Poisson RV |
n |
required number of pseudo-observations |
Prabhanjan N. Tattar
rpois
set.seed(123) mean(Poisson_Sim(4,1000))set.seed(123) mean(Poisson_Sim(4,1000))
A simple function for obtaining the plot of power of UMP test.
powertestplot(mu0, sigma, n, alpha)powertestplot(mu0, sigma, n, alpha)
mu0 |
the value of mean |
sigma |
standard deviation |
n |
sample size |
alpha |
size of the test |
Prabhanjan N. Tattar
t.test
UMPNormal <- function(mu0, sigma, n,alpha) { qnorm(alpha)*sigma/sqrt(n)+mu0 } UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5) powertestplot <- function(mu0,sigma,n,alpha) { mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100)) betamu <- pnorm(sqrt(n)*(mu0seq-mu0)/sigma-qnorm(1-alpha)) plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of UMP Test", main = expression(paste("H:",mu <= mu[0]," vs K:",mu>mu[0]))) abline(h=alpha) abline(v=mu0) } powertestplot(mu0=0,sigma=1,n=10,alpha=0.05) # H:mu > mu_0 vs K: mu <= mu_0 UMPNormal <- function(mu0, sigma, n,alpha) { mu0-qnorm(alpha)*sigma/sqrt(n) } UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5) powertestplot <- function(mu0,sigma,n,alpha) { mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100)) betamu <- pnorm(sqrt(n)*(mu0-mu0seq)/sigma-qnorm(1-alpha)) plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of UMP Test", main=expression(paste("H:",mu >= mu[0]," vs K:",mu<mu[0]))) abline(h=alpha) abline(v=mu0) } powertestplot(mu0=0,sigma=1,n=10,alpha=0.05)UMPNormal <- function(mu0, sigma, n,alpha) { qnorm(alpha)*sigma/sqrt(n)+mu0 } UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5) powertestplot <- function(mu0,sigma,n,alpha) { mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100)) betamu <- pnorm(sqrt(n)*(mu0seq-mu0)/sigma-qnorm(1-alpha)) plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of UMP Test", main = expression(paste("H:",mu <= mu[0]," vs K:",mu>mu[0]))) abline(h=alpha) abline(v=mu0) } powertestplot(mu0=0,sigma=1,n=10,alpha=0.05) # H:mu > mu_0 vs K: mu <= mu_0 UMPNormal <- function(mu0, sigma, n,alpha) { mu0-qnorm(alpha)*sigma/sqrt(n) } UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5) powertestplot <- function(mu0,sigma,n,alpha) { mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100)) betamu <- pnorm(sqrt(n)*(mu0-mu0seq)/sigma-qnorm(1-alpha)) plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of UMP Test", main=expression(paste("H:",mu >= mu[0]," vs K:",mu<mu[0]))) abline(h=alpha) abline(v=mu0) } powertestplot(mu0=0,sigma=1,n=10,alpha=0.05)
The data set is used to understand the sampling variation of the score function. The simulated data is available in Pawitan (2001).
data(ps)data(ps)
A data frame with 10 observations on the following 20 variables.
Sample.1a numeric vector
Sample.2a numeric vector
Sample.3a numeric vector
Sample.4a numeric vector
Sample.5a numeric vector
Sample.6a numeric vector
Sample.7a numeric vector
Sample.8a numeric vector
Sample.9a numeric vector
Sample.10a numeric vector
Sample.11a numeric vector
Sample.12a numeric vector
Sample.13a numeric vector
Sample.14a numeric vector
Sample.15a numeric vector
Sample.16a numeric vector
Sample.17a numeric vector
Sample.18a numeric vector
Sample.19a numeric vector
Sample.20a numeric vector
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
Pawitan, Y. (2001). In All Likelihood. Oxford Science Publications.
data(ps) n <- 10 sample_means <- colMeans(ps) poisson_score_fn <- function(theta,xbar) n*(xbar-theta)/theta theta <- seq(from=2,to=8,by=0.2) plot(theta,sapply(theta,poisson_score_fn,xbar=sample_means[1]),"l",xlab= expression(lambda),ylab=expression(S(lambda)),ylim=c(-5,15)) title(main="B: Score Function Plot of the Poisson Model") for(i in 2:20) lines(theta,sapply(theta,poisson_score_fn,xbar=sample_means[i]),"l") abline(v=4) abline(h=0)data(ps) n <- 10 sample_means <- colMeans(ps) poisson_score_fn <- function(theta,xbar) n*(xbar-theta)/theta theta <- seq(from=2,to=8,by=0.2) plot(theta,sapply(theta,poisson_score_fn,xbar=sample_means[1]),"l",xlab= expression(lambda),ylab=expression(S(lambda)),ylim=c(-5,15)) title(main="B: Score Function Plot of the Poisson Model") for(i in 2:20) lines(theta,sapply(theta,poisson_score_fn,xbar=sample_means[i]),"l") abline(v=4) abline(h=0)
Probe words are used to test the recall ability of words in various linguistic contexts. In this experiment the response time to five different probe words are recorded for 11 individuals. The interest in the experiment is to examine if the response times to the different words are independent or not.
data(pw)data(pw)
A data frame with 11 observations on the following 6 variables.
Subject.Numbera numeric vector
y1a numeric vector
y2a numeric vector
y3a numeric vector
y4a numeric vector
y5a numeric vector
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(pw) sigma <- var(pw[2:6]) p <- ncol(pw)-1; v <- nrow(pw)-1 u <- p^p*(det(sigma))/(sum(diag(sigma))^p) u1 <- -(v-(2*p^2+p+2)/(6*p))*log(u) u;u1data(pw) sigma <- var(pw[2:6]) p <- ncol(pw)-1; v <- nrow(pw)-1 u <- p^p*(det(sigma))/(sum(diag(sigma))^p) u1 <- -(v-(2*p^2+p+2)/(6*p))*log(u) u;u1
Quesenberry and Hurst (1964) have obtained the "simultaneous confidence intervals" for the vector of success in a multinomial distribution.
QH_CI(x, alpha)QH_CI(x, alpha)
x |
a numeric vector |
alpha |
as in 100 (1-alpha) |
Prabhanjan N. Tattar
prop.test
For a chemical reaction experiment, the blocks arise due to the Batch number, Catalyst of different types form the treatments, and the reaction time is the output. Due to a restriction, all the catalysts cannot be analysed within each batch and hence we need to look at the BIBD model.
data("reaction")data("reaction")
A data frame with 16 observations on the following 3 variables.
Catalystdifferent types forming the treatments
Batchbatch number
Reactionreaction time
data(reaction)data(reaction)
"Resistant Line" is an important EDA way of fitting a regression model. The function here develops the discussion in Section 4.5.1 Resistant Line. An alternative for this function is available in "rline" function of the "LearnEDA" package.
resistant_line(x, y, iterations)resistant_line(x, y, iterations)
x |
the covariate or independent vector |
y |
the dependent variate |
iterations |
the required number of iterations |
Prabhanjan N. Tattar
Velleman, P.F., and Hoaglin, D.C. (2004). ABC of Exploratory Data Analysis. Duxbury Press, Boston. Republished in 2004 by The Internet-First University Press.
LearnEDA::rline
Five different formulations of a rocket propellant x1 may be used in an aircrew escape systems on the observed burning rate Y. Here, each of the formulation is prepared by mixing from a batch of raw materials x2 which can support only five formulations required for the purpose of testing.
data(rocket)data(rocket)
A data frame with 25 observations on the following 4 variables.
yburning rate
batchraw materials batch
opexperience of the operator
treatformulation type of the propellant A B C D E
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(rocket) matrix(rocket$treat,nrow=5) par(mfrow=c(1,3)) plot(y~factor(op)+factor(batch)+treat,rocket) rocket_aov <- aov(y~factor(op)+factor(batch)+treat,rocket)data(rocket) matrix(rocket$treat,nrow=5) par(mfrow=c(1,3)) plot(y~factor(op)+factor(batch)+treat,rocket) rocket_aov <- aov(y~factor(op)+factor(batch)+treat,rocket)
In continuation of Example 13.4.7 of the Rocket Propellant data, we now have the added blocking factor in test assemblies.
data(rocket_Graeco)data(rocket_Graeco)
A data frame with 25 observations on the following 5 variables.
yburning rate
batchraw materials batch
opexperience of the operator
treatformulation type of the propellant A B C D E
assemblytest assemblies a b c d e
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(rocket_Graeco) plot(y~op+batch+treat+assembly,rocket_Graeco) rocket.glsd.aov <- aov(y~factor(op)+factor(batch)+treat+assembly,rocket_Graeco) summary(rocket.glsd.aov)data(rocket_Graeco) plot(y~op+batch+treat+assembly,rocket_Graeco) rocket.glsd.aov <- aov(y~factor(op)+factor(batch)+treat+assembly,rocket_Graeco) summary(rocket.glsd.aov)
The goal is to test if the mean vector of the four variables is same across 6 stratas of the experiment.
data(rootstock)data(rootstock)
A data frame with 48 observations on the following 5 variables.
rootstockSix different rootstocks
y1trunk girth at 4 years
y2extension growth at 4 years
y3trunk girth at 15 years
y4weight of tree above ground at 15 years
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(rootstock) attach(rootstock) rs <- rootstock[,1] rs <- factor(rs,ordered=is.ordered(rs)) # Too important a step root.manova <- manova(cbind(y1,y2,y3,y4)~rs) summary(root.manova, test = "Wilks")data(rootstock) attach(rootstock) rs <- rootstock[,1] rs <- factor(rs,ordered=is.ordered(rs)) # Too important a step root.manova <- manova(cbind(y1,y2,y3,y4)~rs) summary(root.manova, test = "Wilks")
In the data set sample, we have data from five different probability distributions. Histograms are used to intuitively understand the underlying probability model.
data(sample)data(sample)
A data frame with 100 observations on the following 5 variables.
Sample_1A sample 1
Sample_2A sample 2
Sample_3A sample 3
Sample_4A sample 4
Sample_5A sample 5
data(sample) layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, 6, byrow=TRUE),respect=FALSE) matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, 6, byrow=TRUE) hist(sample[,1],main="Histogram of Sample 1",xlab="sample1", ylab="frequency") hist(sample[,2],main="Histogram of Sample 2",xlab="sample2", ylab="frequency") hist(sample[,3],main="Histogram of Sample 3",xlab="sample3", ylab="frequency") hist(sample[,4],main="Histogram of Sample 4",xlab="sample4", ylab="frequency") hist(sample[,5],main="Histogram of Sample 5",xlab="sample5", ylab="frequency")data(sample) layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, 6, byrow=TRUE),respect=FALSE) matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, 6, byrow=TRUE) hist(sample[,1],main="Histogram of Sample 1",xlab="sample1", ylab="frequency") hist(sample[,2],main="Histogram of Sample 2",xlab="sample2", ylab="frequency") hist(sample[,3],main="Histogram of Sample 3",xlab="sample3", ylab="frequency") hist(sample[,4],main="Histogram of Sample 4",xlab="sample4", ylab="frequency") hist(sample[,5],main="Histogram of Sample 5",xlab="sample5", ylab="frequency")
The odor and taste of wines are recorded in a study. It is believed that the variables such as the pH concentration, alcohol content, total sugar, etc, explain the odor and taste of the wine.
data(sheishu)data(sheishu)
A data frame with 30 observations on the following 10 variables.
Tastetaste
Odorodor
pHpH concentration
Acidity_1Acidity 1
Acidity_2Acidity 2
Sake_meterSake meter
Direct_reducing_sugarDirect reducing sugar
Total_sugarTotal sugar
Alcoholtype of alcohol
Formyl_nitrogenFormyl nitrogen
Rencher, A.C. (2002). Methods of Multivariate Analysis, 2e. J. Wiley.
data(sheishu) noc <- c(2,3,3,2) nov <- 10 v <- nrow(sheishu)-1 varsheishu <- var(sheishu) s11 <- varsheishu[1:2,1:2] s22 <- varsheishu[3:5,3:5] s33 <- varsheishu[6:8,6:8] s44 <- varsheishu[9:10,9:10] u <- det(varsheishu)/(det(s11)*det(s22)*det(s33)*det(s44)) a2 <- nov^2 - sum(noc^2) a3 <- nov^3 - sum(noc^3) f <- a2/2 cc <- 1 - (2*a3 + 3*a2)/(12*f*v) u1 <- -v*cc*log(u) u; a2; a3; f; cc; u1 qchisq(1-0.001,37)data(sheishu) noc <- c(2,3,3,2) nov <- 10 v <- nrow(sheishu)-1 varsheishu <- var(sheishu) s11 <- varsheishu[1:2,1:2] s22 <- varsheishu[3:5,3:5] s33 <- varsheishu[6:8,6:8] s44 <- varsheishu[9:10,9:10] u <- det(varsheishu)/(det(s11)*det(s22)*det(s33)*det(s44)) a2 <- nov^2 - sum(noc^2) a3 <- nov^3 - sum(noc^3) f <- a2/2 cc <- 1 - (2*a3 + 3*a2)/(12*f*v) u1 <- -v*cc*log(u) u; a2; a3; f; cc; u1 qchisq(1-0.001,37)
A merchandiser stocks soft-drink on a shelf as a multiple number of the number of cases. The time required to put the cases in the shelves is recorded as a response. Clearly, if there are no cases to be stocked, it is natural that the time to put them on the shelf will be 0.
data("shelf_stock")data("shelf_stock")
A data frame with 15 observations on the following 2 variables.
Timetime required to put the cases in the shelves
Cases_Stockednumber of cases
data(shelf_stock) sslm <- lm(Time ~ Cases_Stocked -1, data=shelf_stock)data(shelf_stock) sslm <- lm(Time ~ Cases_Stocked -1, data=shelf_stock)
This function provided an implementation of the nonparametric discussed in "Section 8.5.3 The Siegel-Tukey Test".
siegel.tukey(x, y)siegel.tukey(x, y)
x |
Values from Sample 1 |
y |
Values from Sample 2 |
For more details, refer Section 8.5.3 The Siegel-Tukey Test.
Prabhanjan N. Tattar
x <- c(0.028, 0.029, 0.011, -0.030, 0.017, -0.012, -0.027,-0.018, 0.022, -0.023) y <- c(-0.002, 0.016, 0.005, -0.001, 0.000, 0.008, -0.005,-0.009, 0.001, -0.019) siegel.tukey(x,y)x <- c(0.028, 0.029, 0.011, -0.030, 0.017, -0.012, -0.027,-0.018, 0.022, -0.023) y <- c(-0.002, 0.016, 0.005, -0.001, 0.000, 0.008, -0.005,-0.009, 0.001, -0.019) siegel.tukey(x,y)
The function is fairly easy to follow.
skewcoeff(x)skewcoeff(x)
x |
variable of interest |
Prabhanjan N. Tattar
e1071::skewness
A cooked data tailor made for the use of scatter plots towards understanding correlations.
data(somesamples)data(somesamples)
A data frame with 200 observations on the following 12 variables.
x1x of Sample 1
y1y of Sample 1
x2x of Sample 2
y2y of Sample 2
x3x of Sample 3
y3y of Sample 3
x4x of Sample 4
y4y of Sample 4
x5x of Sample 5
y5y of Sample 5
x6x of Sample 6
y6y of Sample 6
data(somesamples) attach(somesamples) par(mfrow=c(2,3)) plot(x1,y1,main="Sample I",xlim=c(-4,4),ylim=c(-4,4)) plot(x2,y2,main="Sample II",xlim=c(-4,4),ylim=c(-4,4)) plot(x3,y3,main="Sample III",xlim=c(-4,4),ylim=c(-4,4)) plot(x4,y4,main="Sample IV",xlim=c(-4,4),ylim=c(-4,4)) plot(x5,y5,main="Sample V",xlim=c(-4,4),ylim=c(-4,4)) plot(x6,y6,main="Sample VI",xlim=c(-4,4),ylim=c(-4,4))data(somesamples) attach(somesamples) par(mfrow=c(2,3)) plot(x1,y1,main="Sample I",xlim=c(-4,4),ylim=c(-4,4)) plot(x2,y2,main="Sample II",xlim=c(-4,4),ylim=c(-4,4)) plot(x3,y3,main="Sample III",xlim=c(-4,4),ylim=c(-4,4)) plot(x4,y4,main="Sample IV",xlim=c(-4,4),ylim=c(-4,4)) plot(x5,y5,main="Sample V",xlim=c(-4,4),ylim=c(-4,4)) plot(x6,y6,main="Sample VI",xlim=c(-4,4),ylim=c(-4,4))
The strength of a paper depends on three variables: (i) the percentage of hardwood concentration in the raw pulp, (ii) the vat pressure, and (iii) the cooking time of the pulp. The hardwood concentration is tested at three levels of 2, 4, and 8 percentage, the vat pressure at 400, 500, and 650, while the cooking time is at 3 and 4 hours. For each combination of the these three factor variables, 2 observations are available, and thus a total of 3.3.2.2 = 36 observations. The goal of the study is investigation of the impact of the three factor variables on the strength of the paper, and the presence of interaction effect, if any.
data(SP)data(SP)
A data frame with 36 observations on the following 4 variables.
Hardwooda factor with levels 2 4 8
Pressurea factor with levels 400 500 650
Cooking_Timea factor with levels 3 4
Strengtha numeric vector
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(SP) summary(SP.aov <- aov(Strength~.^3,SP))data(SP) summary(SP.aov <- aov(Strength~.^3,SP))
An implementation of the algorithm for simulation of observations from an arbitrary discrete distribution is provided here.
ST_Ordered(N, x, p_x)ST_Ordered(N, x, p_x)
N |
number of required random observations |
x |
the possible values of the RV |
p_x |
the probability vector associated with x |
Prabhanjan N. Tattar
sample
N <- 1e4 x <- 1:10 p_x <- c(0.05,0.17,0.02,0.14,0.11,0.06,0.05,0.04,0.17,0.19) table(ST_Ordered(N, x, p_x))N <- 1e4 x <- 1:10 p_x <- c(0.05,0.17,0.02,0.14,0.11,0.06,0.05,0.04,0.17,0.19) table(ST_Ordered(N, x, p_x))
Simulation observations from an arbitrary discrete distribution with probabilities arranged in desending/ascending order.
ST_Unordered(N, x, p_x)ST_Unordered(N, x, p_x)
N |
number of required random observations |
x |
the possible values of the RV |
p_x |
the probability vector associated with x |
Prabhanjan N. Tattar
sample
N <- 1e2 x <- 1:10 p_x <- c(0.05,0.17,0.02,0.14,0.11,0.06,0.05,0.04,0.17,0.19) ST_Unordered(N,x,p_x)N <- 1e2 x <- 1:10 p_x <- c(0.05,0.17,0.02,0.14,0.11,0.06,0.05,0.04,0.17,0.19) ST_Unordered(N,x,p_x)
This function returns the stationary distribution of an ergodic Markov chain. For details, refer Chapter 11 of the book.
stationdistTPM(M)stationdistTPM(M)
M |
a transition probability matrix (TPM) |
Prabhanjan N. Tattar
eigen
P <- matrix(nrow=3,ncol=3) # An example P[1,] <- c(1/3,1/3,1/3) P[2,] <- c(1/4,1/2,1/4) P[3,] <- c(1/6,1/3,1/2) stationdistTPM(P)P <- matrix(nrow=3,ncol=3) # An example P[1,] <- c(1/3,1/3,1/3) P[2,] <- c(1/4,1/2,1/4) P[3,] <- c(1/6,1/3,1/2) stationdistTPM(P)
Four measures of stiffness of 30 boards are available. The first measure of stiffness is obtained by sending a shock wave down the board, the second measure is obtained by vibrating the board, and remaining are obtained from static tests.
data(stiff)data(stiff)
A data frame with 30 observations on the following 4 variables.
x1first measure of stiffness is obtained by sending a shock wave down the board
x2second measure is obtained by vibrating the board
x3third measure is obtained by a static test
x4fourth measure is obtained by a static test
Johnson, R.A., and Wichern, D.W. (1982-2007). Applied Multivariate Statistical Analysis, 6e. Pearson Education.
data(stiff) colMeans(stiff) var(stiff) pairs(stiff)data(stiff) colMeans(stiff) var(stiff) pairs(stiff)
The swiss data set consists of measurements on the width of bottom margin and image diagonal length for forged and real notes. Histogram smoothing method is applied to understand the width of bottom margins for the forged notes.
data(swiss)data(swiss)
A data frame with 100 observations on the following 4 variables.
Bottforgbottom margin of forged notes
Diagforgdiagonal margin of forged notes
Bottrealbottom margin of real notes
Diagrealdiagonal margin of real notes
Simonoff, J.S. (1996). Smoothing Methods in Statistics. Springer.
data(swiss) par(mfrow=c(1,3)) hist(swiss$Bottforg,breaks=28,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density") hist(swiss$Bottforg,breaks=12,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density") hist(swiss$Bottforg,breaks=6,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density")data(swiss) par(mfrow=c(1,3)) hist(swiss$Bottforg,breaks=28,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density") hist(swiss$Bottforg,breaks=12,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density") hist(swiss$Bottforg,breaks=6,probability=TRUE,col=0,ylim=c(0,.5), xlab="Margin width (mm)",ylab="Density")
The Toluca Company manufactures equipment related to refrigerator. The company, in respect of a particular component of a refrigerator, has data on the labor hours required for the component in various lot sizes. Using this data, the officials wanted to find the optimum lot size for producing this part.
data("tc")data("tc")
A data frame with 25 observations on the following 2 variables.
Lot_Sizesize of the lot
Labour_Hoursthe labor hours required
Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (1974-2005). Applied Linear Statistical Models, 5e. McGraw-Hill.
data(tc) tclm <- lm(Labour_Hours~Lot_Size,data=tc) tclm$coefficientsdata(tc) tclm <- lm(Labour_Hours~Lot_Size,data=tc) tclm$coefficients
An engineer wants to find out if the cotton weight percentage in a synthetic fiber effects the tensile strength. Towards this, the cotton weight percentage is fixed at 5 different levels of 15, 20, 25, 30, and 35. Each level of the percentage is assigned 5 experimental units and the tensile strength is measured on each of them. The randomization is specified in the Run_Number column. The goal of the engineer is to investigate if the tensile strength is same across the cotton weight percentage.
data(tensile)data(tensile)
A data frame with 25 observations on the following 4 variables.
Test_Sequencethe test sequence
Run_Numberthe run number
CWPcotton weight percentage
Tensile_Strengththe tensile strength
Montgomery, D. C. (1976-2012). Design and Analysis of Experiments, 8e. J.Wiley.
data(tensile) tensile$CWP <- as.factor(tensile$CWP) tensile_aov <- aov(Tensile_Strength~CWP, data=tensile) summary(tensile_aov) model.matrix(tensile_aov)data(tensile) tensile$CWP <- as.factor(tensile$CWP) tensile_aov <- aov(Tensile_Strength~CWP, data=tensile) summary(tensile_aov) model.matrix(tensile_aov)
A transition probaility matrix for understanding Markov chains.
data(testtpm)data(testtpm)
A matrix of transition probability matrix
Atransitions probabilities from State A
Btransitions probabilities from State B
Ctransitions probabilities from State C
Dtransitions probabilities from State D
Etransitions probabilities from State E
Ftransitions probabilities from State F
data(testtpm)data(testtpm)
A transition probaility matrix for understanding Markov chains.
data(testtpm2)data(testtpm2)
A matrix of transition probability matrix.
Atransitions probabilities from State A
Btransitions probabilities from State B
Ctransitions probabilities from State C
Dtransitions probabilities from State D
Etransitions probabilities from State E
Ftransitions probabilities from State F
data(testtpm2)data(testtpm2)
A transition probaility matrix for understanding Markov chains
data(testtpm3)data(testtpm3)
A data frame with 7 observations on the following 7 variables.
Atransitions probabilities from State A
Btransitions probabilities from State B
Ctransitions probabilities from State C
Dtransitions probabilities from State D
Etransitions probabilities from State E
Ftransitions probabilities from State F
Gtransitions probabilities from State G
data(testtpm3)data(testtpm3)
The trimean can be viewed as the average of median and average of the lower and upper quartiles. A fairly simply function is defined here.
TM(x)TM(x)
x |
variable of interest |
Prabhanjan N. Tattar
TMH, mean, median
The trimean is modified and defined based on hinges instead of the quartiles.
TMH(x)TMH(x)
x |
variable of interest |
Prabhanjan N. Tattar
TM
A function is defined here which will return the uniformly most powerful test for exponential distribution. The function needs a simple use of the "qgamma" function.
UMPExponential(theta0, n, alpha)UMPExponential(theta0, n, alpha)
theta0 |
the parameter of interest |
n |
the sample size |
alpha |
level of the UMP test |
Prabhanjan N. Tattar
The "UMPNormal" function returns the critical points required for the UMP test for a sample from normal distribution. The standard deviation is assumed to be known.
UMPNormal(mu0, sigma, n, alpha)UMPNormal(mu0, sigma, n, alpha)
mu0 |
the value of mean of interest |
sigma |
standard deviation |
n |
sample size |
alpha |
size of the UMP test |
Prabhanjan N. Tattar
A simple and straightforward function for obtaining the UMP test for a random sample from uniform distribution.
UMPUniform(theta0, n, alpha)UMPUniform(theta0, n, alpha)
theta0 |
the parameter value of interest |
n |
the sample size |
alpha |
the size of the required UMP test |
Prabhanjan N. Tattar
UMPUniform(0.6,10,0.05)UMPUniform(0.6,10,0.05)
Data is available on the crime rates across 47 states in USA, and we have additional information on 13 more explanatory variables.
data(usc)data(usc)
A data frame with 47 observations on the following 14 variables.
RCrime rate - the number of offenses known to the police per 1,000,000 population
AgeAge distribution - the number of males aged 14 to 24 years per 1000 of total state population
SBinary variable distinguishing southern states (S = 1) from the rest
EdEducational level - mean number of years of schooling times 10 of the population 25 years old and over
Ex0Police expenditure - per capita expenditure on police protection by state and local governments in 1960
Ex1Police expenditure - as Ex0, but for 1959
LFLabour force participation rate per 1000 civilian urban males in the age group 14 to 24 years
MNumber of males per 1000 females
NState population size in hundred thousands
NWNumber of non-whites per 1000
U1Unemployment rate of urban males per 1000 in the age group 14 to 24 years
U2Unemployment rate of urban males per 1000 in the age group 35 to 39 years
WWealth, as measured by the median value of transferable goods and assets. or family income (unit 10 dollars)
XIncome inequality: the number of families per 1000 earning below one half of the median income
Der, G., and Everitt, B.S. (2002). A Handbook of Statistical Analysis using SAS, 2e. CRC.
data(usc) pairs(usc) round(cor(usc),2)data(usc) pairs(usc) round(cor(usc),2)
The goal of this study is to find the impact of temperature on the viscosity of toluence-tetralin blends.
data(viscos)data(viscos)
A data frame with 8 observations on the following 2 variables.
Temperaturetemperature
Viscosityviscosity of toluence-tetralin blends
Montgomery, D.C., Peck, E.A., and Vining, G.G. (1983-2012). Introduction to Linear Regression Analysis, 5e. J. Wiley.
data(viscos) names(viscos) viscoslm <- lm(Viscosity~Temperature,data=viscos)data(viscos) names(viscos) viscoslm <- lm(Viscosity~Temperature,data=viscos)
The "vonNeumann" function implements the von Neumann random generator as detailed in Section 11.2.
vonNeumann(x, n)vonNeumann(x, n)
x |
the initial seed |
n |
number of required observations |
Prabhanjan N. Tattar
vonNeumann(x=11,n=10) vonNeumann(x=675248,n=10) vonNeumann(x=8653,n=100)vonNeumann(x=11,n=10) vonNeumann(x=675248,n=10) vonNeumann(x=8653,n=100)
Water
samples from four cities are collected and their physico-chemical properties for ten variables, such as pH, Conductivity, Dissolution, etc., are measured. We would then like to test if the properties are same across the four cities and in which case a same water treatment approach can be adopted across the cities.
data(waterquality)data(waterquality)
A data frame with 63 observations on the following 10 variables.
Cityfour cities City1 City2 City3 City4
pHthe pH concentration
Conductivitywater conductivity
Dissolutionwater dissolution
Alkalinityalkalinity of the water sample
Hardnesswater hardness
Calcium.Hardnesscalcium hardness of the water
Magnesium.Hardnessmagnesium hardness of the water
Chlorideschloride content
Sulphatessulphate content
Gore, A.P., Paranjape, S. A., and Kulkarni, M.B. (2006). 100 Data Sets for Statistics Education. Department of Statistics, University of Pune.
data(waterquality)data(waterquality)
The Wilson confidence interval for a sample from binomial distribution is a complex formula. This function helps the reader in easily obtaining the required confidence interval as discussed and detailed in Section 16.5.
WilsonCI(x, n, alpha)WilsonCI(x, n, alpha)
x |
the number of successes |
n |
the number of trials |
alpha |
the confidence interval size |
Prabhanjan N. Tattar
WilsonCI(x=10658,n=15000,alpha=0.05) prop.test(x=10658,n=15000)$conf.intWilsonCI(x=10658,n=15000,alpha=0.05) prop.test(x=10658,n=15000)$conf.int
The "ww.test" function is an implementation of the famous Wald-Wolfowitz nonparametric test as discussed in Section 8.5.
ww.test(x, y)ww.test(x, y)
x |
values from sample 1 |
y |
values from sample 2 |
Prabhanjan N. Tattar
This is a simulated dataset with two modes at -2 and 2 and we have 400 observations.
data(x_bimodal)data(x_bimodal)
The format is: num [1:400] -4.68 -4.19 -4.05 -4.04 -4.02 ...
data(x_bimodal) h <- 0.5; n <- length(x_bimodal) dens_unif <- NULL; dens_triangle <- NULL; dens_epanechnikov <- NULL dens_biweight <- NULL; dens_triweight <- NULL; dens_gaussian <- NULL for(i in 1:n) { u <- (x_bimodal[i]-x_bimodal)/h xlogical <- (u>-1 & u <= 1) dens_unif[i] <- (1/(n*h))*(sum(xlogical)/2) dens_triangle[i] <- (1/(n*h))*(sum(xlogical*(1-abs(u)))) dens_epanechnikov[i] <- (1/(n*h))*(sum(3*xlogical*(1-u^2)/4)) dens_biweight[i] <- (1/(n*h))*(15*sum(xlogical*(1-u^2)^2/16)) dens_triweight[i] <- (1/(n*h))*(35*sum(xlogical*(1-u^2)^3/32)) dens_gaussian[i] <- (1/(n*h))*(sum(exp(-u^2/2)/sqrt(2*pi))) } plot(x_bimodal,dens_unif,"l",ylim=c(0,.25),xlim=c(-5,7),xlab="x", ylab="Density",main="B: Applying Kernel Smoothing") points(x_bimodal,dens_triangle,"l",col="red") points(x_bimodal,dens_epanechnikov,"l",col="green") points(x_bimodal,dens_biweight,"l",col="blue") points(x_bimodal,dens_triweight,"l",col="yellow") points(x_bimodal,dens_gaussian,"l",col="orange") legend(4,.23,legend=c("rectangular","triangular","epanechnikov","biweight", "gaussian"),col=c("black","red","green","blue","orange"),lty=1)data(x_bimodal) h <- 0.5; n <- length(x_bimodal) dens_unif <- NULL; dens_triangle <- NULL; dens_epanechnikov <- NULL dens_biweight <- NULL; dens_triweight <- NULL; dens_gaussian <- NULL for(i in 1:n) { u <- (x_bimodal[i]-x_bimodal)/h xlogical <- (u>-1 & u <= 1) dens_unif[i] <- (1/(n*h))*(sum(xlogical)/2) dens_triangle[i] <- (1/(n*h))*(sum(xlogical*(1-abs(u)))) dens_epanechnikov[i] <- (1/(n*h))*(sum(3*xlogical*(1-u^2)/4)) dens_biweight[i] <- (1/(n*h))*(15*sum(xlogical*(1-u^2)^2/16)) dens_triweight[i] <- (1/(n*h))*(35*sum(xlogical*(1-u^2)^3/32)) dens_gaussian[i] <- (1/(n*h))*(sum(exp(-u^2/2)/sqrt(2*pi))) } plot(x_bimodal,dens_unif,"l",ylim=c(0,.25),xlim=c(-5,7),xlab="x", ylab="Density",main="B: Applying Kernel Smoothing") points(x_bimodal,dens_triangle,"l",col="red") points(x_bimodal,dens_epanechnikov,"l",col="green") points(x_bimodal,dens_biweight,"l",col="blue") points(x_bimodal,dens_triweight,"l",col="yellow") points(x_bimodal,dens_gaussian,"l",col="orange") legend(4,.23,legend=c("rectangular","triangular","epanechnikov","biweight", "gaussian"),col=c("black","red","green","blue","orange"),lty=1)
A simple and innovative design is often priceless. Youden and Beale (1934) sought to find the effect of two preparations of virus on tobacco plants. One half of a tobacco leaf was rubbed with cheesecloth soaked in one preparation of the virus extract and the second half was rubbed with the other virus extract. This experiment was replicated on just eight leaves, and the number of lesions on each half leaf was recorded.
data(yb)data(yb)
A data frame with 8 observations on the following 2 variables.
Preparation_1a numeric vector
Preparation_2a numeric vector
Youden, W. J., and Beale, H. P. (1934). A Statistical Study of the Local Lesion Method for Estimating Tobacco Mosaic Virus. Contrib. Boyce Thompson Inst, 6, 437-454.
data(yb) summary(yb) quantile(yb$Preparation_1,seq(0,1,.1)) # here seq gives 0, .1, .2, ...,1 quantile(yb$Preparation_2,seq(0,1,.1)) fivenum(yb$Preparation_1) fivenum(yb$Preparation_2)data(yb) summary(yb) quantile(yb$Preparation_1,seq(0,1,.1)) # here seq gives 0, .1, .2, ...,1 quantile(yb$Preparation_2,seq(0,1,.1)) fivenum(yb$Preparation_1) fivenum(yb$Preparation_2)