Now we take the medians of the lower and upper halves. medApplIncome medCoAppIncome = MEDIAN(ApplicantIncome CoapplicantIncome). First, the overall medians: COMPUTE constant = 1.ĪGGREGATE OUTFILE = * MODE = ADDVARIABLES To find the quarters, we can take the median twice. I will adopt the limit the whiskers of a boxplot often represent: Q3 + 1,5 * (Q3 - Q1) where Q1 and Q3 are the first and third quarters of the distribution respectively. We don’t know what values we will have to face in the validation data, or worse, in the data we will apply the model ( coefficient * value) to. The two income variables ApplicantIncome and CoapplicantIncome, having high maximum values (unlike LoanAmount), deserve an outlier limit. RECODE LoanAmount (SYSMIS = 146.41) /* meansub */. For LoanAmount, the only scale variable with missing values, we substitute with the mean. For didactic purposes we will treat it as categorical.Īs said before, we will substitute missing values with the mode for discrete variables. Of course, urbanization could be treated as an ordinal variable, in which case we could feed it to the regression as if it were numeric. VALUE LABELS urbanization 1 "Rural" 2 "Semiurban" 3 "Urban". RECODE Property_Area ("Rural" = 1) ("Semiurban" = 2) ("Urban" = 3) While for many other systems (like R in the blogpost referenced at the top) you would create 3 boolean variables – one for each category -, PSPP’s LOGISTIC REGRESSION wants to do the recoding itself (through the /CATEGORICAL subcommand) and will fail to deal with this kind of mutually exclusive variables. The Property_Area variable is a categorical variable with 3 levels. COMPUTE statusOK = Loan_status EQ "Y".ĬOMPUTE isGraduate = Education NE "Not Graduate".ĬOMPUTE isSelfEmployed = Self_Employed EQ "Yes". We recode the variables with 2 levels into boolean 0/1, assigning empty values to the most frequent category. Furthermore, we have to deal with the missing data. The character variables have, of course, to be translated into numeric. SET MXWARNS = 0.įirst some univariate statistics: FREQUENCIES Loan_Status Gender MarriedĬoapplicantIncome LoanAmount. PSPPIRE’s import wizard works like a charm in this case. The training dataĪs always when importing text data, it is a good idea to turn off warnings. The data sets have a column Loan_Status (values “ Y” for ok and “ N” for default) representing the dependent variable. The same files are on Kaggle with a Public Domain license, so I guess it is OK for you and me to download: I am going to use 2 data sets referenced in this blog post by Anup Kumar Jana. It does not (yet?) offer the possibility to save and apply models, but as you will see this can be overcome with a bit of effort. PSPP offers logistic regression – a great tool to build classifiers with.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |