Building an empirical cumulative distribution function and data interpolation in R -


here's example data frame i'm working with

 level    income    cumpop  1      17995.50  0.028405  2      20994.75  0.065550  3      29992.50  0.876185  4      41989.50  2.364170  5      53986.50  4.267305  6      65983.50  6.323390  7      77980.51  8.357625  8      89977.50 10.238910  9     101974.50 11.923545 10     113971.51 13.389680 11     125968.49 14.659165 12     137965.50 15.753850 13     149962.52 16.673735 14     161959.50 17.438485 15     173956.50 18.093985 16     185953.52 18.640235 17     197950.52 19.099085 18     209947.52 19.514235 19     221944.50 19.863835 20     233941.50 20.169735 21     251936.98 20.628585 22     275931.00 20.936670 23     383904.00 21.850000 

the entire population of particular country has been sorted income , grouped 23 corresponding 'levels'. income variable average income of members of level (this importantly different saying, example, 10th percentile income 17995.50).

but population size of each level inconsistent (you'll notice if @ difference in cumpop i.e. cumulative population). ultimately, want build 10-row data frame gives interpolated decile values variable income, that, example, we'd able "the poorest 10% of population on average make 28,000" or "those in 20th 30th percentile of population on average make 41,000" or on. want reduce these 23 levels 10 levels of equal population size (taking cumpop[23] total population), requires interpolation.

i've looked around library sort of empirical cumulative distribution function generation/interpolation , seems ecdf quite useful, i'm not sure how apply income subject cumpop described above.

would appreciate direction here.

a quick , dirty solution using loess interploation. span set short ensure perfect fit, sadly makes error terms meaningless. worth trying proper regression.

incdist <- read.table("inc.txt", header=true)  fit <- loess(incdist$income~incdist$cumpop, span=0.2) v2 <- predict(fit, seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10)) v1 <- seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10) pred <- data.frame(v1, v2)  par(mar=c(5, 5.5, 4, 2) + 0.1)  plot(incdist$income~incdist$cumpop, type="n", xaxt="n", yaxt="n",     xlab="percentile", ylab=expression(frac("average income",1000)),     main="income distribution")  abline(h=v2, v=v1[-1], col="grey") points(incdist$income~incdist$cumpop, col="grey") lines(loess(incdist$income~incdist$cumpop, span=0.2), col="red") points(pred, col="blue", cex=1.5, pch=9) axis(side=1, at=v1[-1], labels=c(1:9)*10) axis(side=2, at=v2, labels=round(v2/1000), las=1) 

enter image description here


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -