Building an empirical cumulative distribution function and data interpolation in R -
here's example data frame i'm working with
level income cumpop 1 17995.50 0.028405 2 20994.75 0.065550 3 29992.50 0.876185 4 41989.50 2.364170 5 53986.50 4.267305 6 65983.50 6.323390 7 77980.51 8.357625 8 89977.50 10.238910 9 101974.50 11.923545 10 113971.51 13.389680 11 125968.49 14.659165 12 137965.50 15.753850 13 149962.52 16.673735 14 161959.50 17.438485 15 173956.50 18.093985 16 185953.52 18.640235 17 197950.52 19.099085 18 209947.52 19.514235 19 221944.50 19.863835 20 233941.50 20.169735 21 251936.98 20.628585 22 275931.00 20.936670 23 383904.00 21.850000
the entire population of particular country has been sorted income , grouped 23 corresponding 'levels'. income
variable average income of members of level (this importantly different saying, example, 10th percentile income 17995.50).
but population size of each level inconsistent (you'll notice if @ difference in cumpop
i.e. cumulative population). ultimately, want build 10-row data frame gives interpolated decile values variable income
, that, example, we'd able "the poorest 10% of population on average make 28,000" or "those in 20th 30th percentile of population on average make 41,000" or on. want reduce these 23 levels 10 levels of equal population size (taking cumpop[23] total population), requires interpolation.
i've looked around library sort of empirical cumulative distribution function generation/interpolation , seems ecdf
quite useful, i'm not sure how apply income
subject cumpop
described above.
would appreciate direction here.
a quick , dirty solution using loess interploation. span set short ensure perfect fit, sadly makes error terms meaningless. worth trying proper regression.
incdist <- read.table("inc.txt", header=true) fit <- loess(incdist$income~incdist$cumpop, span=0.2) v2 <- predict(fit, seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10)) v1 <- seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10) pred <- data.frame(v1, v2) par(mar=c(5, 5.5, 4, 2) + 0.1) plot(incdist$income~incdist$cumpop, type="n", xaxt="n", yaxt="n", xlab="percentile", ylab=expression(frac("average income",1000)), main="income distribution") abline(h=v2, v=v1[-1], col="grey") points(incdist$income~incdist$cumpop, col="grey") lines(loess(incdist$income~incdist$cumpop, span=0.2), col="red") points(pred, col="blue", cex=1.5, pch=9) axis(side=1, at=v1[-1], labels=c(1:9)*10) axis(side=2, at=v2, labels=round(v2/1000), las=1)
Comments
Post a Comment