Programmability in SPSS 15 The Revolution Continues Jon Peck

advertisement
Programmability in
SPSS 15
The Revolution Continues
Jon Peck
Technical Advisor
SPSS
Copyright (c) SPSS Inc, 2006
Agenda

Recap of SPSS 14 Python programmability

Developer Central

New features in SPSS 15 programmability


Writing first-class procedures
Updating the data

The Bonus Pack modules

Interacting with the user

Q&A

Conclusion
Copyright (c) SPSS Inc, 2006
Quotations from SPSS Users

"Because of programmability, SPSS 14 is the most
important release since I started using SPSS fifteen
years ago."

"I think I am going to like using Python."

"Python, here I come!"

"I now think Python is an amazing language."

"Python and SPSS 14 and later are, IMHO, GREAT!"

"By the way, Python is a great addition to SPSS."
Copyright (c) SPSS Inc, 2006
The Combination of SPSS and
Python

SPSS provides a powerful engine for statistical and
graphical methods and for data management.

Python® provides a powerful, elegant, and easyto-learn language for controlling and responding to
this engine.

Together they provide a comprehensive system for
serious applications of analytical methods to data.
Copyright (c) SPSS Inc, 2006
Programmability Features in
SPSS 14 and 15

SPSS 14.0 provided






Programmability
Multiple datasets
Variable and File Attributes
Programmability read-access to case data
Ability to control SPSS from a Python program
SPSS 15 adds




Read and write case data
Create new variables directly rather than generating syntax
Create pivot tables and text blocks via backend API’s
Easier setup
Copyright (c) SPSS Inc, 2006
Programmability Advantages

Makes possible jobs that respond to datasets, output,
environment

Allows greater generality, more automation

Makes jobs more robust

Allows extending the capabilities of SPSS

Enables better organized and more maintainable code

Facilitates staff specialization

Increases productivity

More fun
Copyright (c) SPSS Inc, 2006
Programmability Overview

Python extends SPSS via






Runs in "back-end" syntax context (like macro)


SaxBasic scripting runs in "front-end" context
Two modes



General programming language
Access to variable dictionary, case data, and output
Access to standard and third-party modules
SPSS Developer Central modules
Module structure for building libraries of code
Traditional SPSS syntax window
Drive SPSS from Python (external mode)
Optional install
Copyright (c) SPSS Inc, 2006
Legal Notice

SPSS is not the owner or licensor of the Python
software. Any user of Python must agree to the
terms of the Python license agreement located on
the Python web site. SPSS is not making any
statement about the quality of the Python program.
SPSS fully disclaims all liability associated with
your use of the Python program.
Copyright (c) SPSS Inc, 2006
The SPSS Programmability SDK

Supports implementing various programming
languages


Requires a programmer to implement a new language
VB.NET Plug-In available on Developer Central

Works only in external mode
Copyright (c) SPSS Inc, 2006
How Programmability Works

Python interpreter embedded within SPSS

SPSS runs in traditional way until BEGIN PROGRAM
command is found

Python collects commands until END PROGRAM
command is found; then runs the program

Python can communicate with SPSS through API's (calls to
functions)


Includes running SPSS syntax inside Python program
Includes creating macro values for later use in syntax

Python can access SPSS output and data

OMS is a key tool
Copyright (c) SPSS Inc, 2006
Example:
Summarize Categorical Variables
BEGIN PROGRAM.
import spss, spssaux
spssaux.GetSPSSInstallDir("SPSSDIR")
spssaux.OpenDataFile("SPSSDIR/employee data.sav")
# find categorical variables
catVars = spssaux.VariableDict(variableLevel=['nominal', 'ordinal'])
if catVars:
spss.Submit("FREQ " + " ".join(catVars.variables))
# create a macro listing categorical variables
spss.SetMacroValue("!catVars", " ".join(catVars.variables))
END PROGRAM.
DESC !catVars.
Run
Copyright (c) SPSS Inc, 2006
Programmability Inside or Outside
SPSS

Two modes of operation

SPSS Drives mode (inside): traditional syntax context


BEGIN PROGRAM …program… END PROGRAM
X Drives mode (outside): eXternal program drives SPSS



Python interpreter (or VB.NET)
import spss
No SPSS Viewer, Data Editor, or SPSS user interface



Output sent as text to the application – can be suppressed
Has performance advantages
Build programs with an IDE

Even if to be run in traditional mode
Copyright (c) SPSS Inc, 2006
PythonWin IDE Controlling SPSS
Copyright (c) SPSS Inc, 2006
Python Resources

Python.org

Python Tutorial

Global (standard) Module Index

Python help system and help command

Cheeseshop

1627 packages as of Sept 21, 2006

SPSS Developer Central

SPSS Programming and Data Management, 3rd ed, 2006.

Many books

Look for books at the Python 2.4 level
Copyright (c) SPSS Inc, 2006
Python Books

Dive Into Python book or PDF

Practical Python by Magnus Lie Hetland

Extensive examples and discussion of Python

Python Cookbook, 2nd ed by Martelli, Ravenscroft, & Ascher

Second edition (July, 2006) of
Martelli, Python in a Nutshell, O'Reilly


Very clear, comprehensive reference material
wxPython in Action by Rappin and Dunn

Explains user interface building with wxPython
Copyright (c) SPSS Inc, 2006
Cheeseshop: scipy

scipy 0.5.0 Scientific Algorithms Library for Python

scipy is an open source library of scientific tools for
Python. scipy gathers a variety of high level science and
engineering modules together as a single package. scipy
provides modules for statistics, optimization, integration,
linear algebra, Fourier transforms, signal and image
processing, genetic algorithms, ODE solvers, special
functions, and more. scipy requires and supplements
NumPy, which provides a multidimensional array object and
other basic functionality.

scipy rework currently beta

Visit Scipy.org
Copyright (c) SPSS Inc, 2006
SPSS Developer Central

New Web home for developing SPSS applications

SPSS Developer Central

old url: forums.spss.com/code_center

Python Integration Plug-Ins

Useful supplementary modules by SPSS and others

Updated for SPSS 15

Articles on programmability and graphics

Place to ask questions and exchange information

Programmability Extension SDK

Get Python itself from Python.org


SPSS uses 2.4. (2.4.3)
Not limited to programmability


Went Live
21-May-2006
Key Supplementary
Modules
spssaux
spssdata
New for SPSS 15
trans
extendedTransforms
rake
pls
GPL graphics
User-contributed code
Copyright (c) SPSS Inc, 2006
Approaches to
Creating New Procedures

You can extend SPSS capabilities by building new procedures


Combine SPSS procedures and transformations with Python
logic



Poisson regression (SPSS 14) example using iterated CNLR
New raking procedure built over GENLOG
Calculate data aggregates in SPSS and pass to algorithm
coded in Python


Or use ones that others have built
Raking procedure starts with AGGREGATE
Acquire case data and compute in Python


Use Python standard modules and third-party additions
Partial Least Squares Regression (pls module)
Copyright (c) SPSS Inc, 2006
Adapt Existing Code Libraries

Common to adapt existing libraries or code for use
as Python extension modules


Extension modules are normal Python modules



C, C++, VB, Fortran,...
Python itself written in C
Many standard modules are C code
Python tools and API's to assist

Chap 25 in Python in a Nutshell
 Tutorial on extending and embedding the Python
interpreter
Copyright (c) SPSS Inc, 2006
Partial Least Squares Regression

Regression with large number of predictors (even k > N)

Similar to Principal Components but considers dependent
variable simultaneously

Calculates principal components of (y, X) then use regression
on the scores instead of original data

User chooses number of factors

Equivalent to ordinary regression when number of factors
equals number of predictors and one y variable

For more information see An Optimization Perspective on
Kernel Partial Least Squares Regression.pdf.
Copyright (c) SPSS Inc, 2006
The pls Module

Strategy


Fetches data from SPSS
Uses scipy matrix operations to compute results


Writes pivot tables to SPSS Viewer



Third-party module from Cheeseshop
Subject to OMS
SPSS 14 viewer module created pivot table using OLE automation
Saves predicted values to active dataset
Copyright (c) SPSS Inc, 2006
pls Example: REGRESSION vs
PLS
GET FILE="c:/spss15/tutorial/sample_files/car_sales.sav".
REGRESSION /STATISTICS COEFF R /DEPENDENT sales
/METHOD=ENTER curb_wgt engine_s fuel_cap horsepow
length mpg price resale type wheelbas width .
begin program.
import spss, pls
pls.plsproc("sales", """curb_wgt engine_s fuel_cap horsepow
length mpg price resale type wheelbas width""",
yhat="predsales")
end program.

plsproc defaults to five factors
Copyright (c) SPSS Inc, 2006
Results

PLS with 5 factors
almost equals
regression with 11
variables
Copyright (c) SPSS Inc, 2006
Raking Sample Weights

"Raking" adjusts sample weights to control totals in n
dimensions

Example: data classified by age and sex with known
population totals or proportions

Calculated by fitting a main effects loglinear model



Various adjustments required
Not a complete solution to reweighting
Not directly available in SPSS
Copyright (c) SPSS Inc, 2006
Raking Module

Strategy: combine SPSS procedures with Python logic

rake.py (part of SPSS 15 Bonus Pack)






Aggregates data via AGGREGATE to new dataset
Creates new variable with control totals
Applies GENLOG, saving predicted counts
Adjusts predicted counts
Matches back into original dataset
 Does not use MATCH FILES or require a SORT command
Written in one (long) day
rake.rake("age sex",
[{0: 1140, 1:1140}, {0: 104.6, 1:2175.4}],
finalweight="finalwt")
Copyright (c) SPSS Inc, 2006
Extending SPSS Transformations

SPSS 14 programmability can wrap SPSS syntax in Python
logic


SPSS 15 programmability can generate new variables directly


Cursor can have accessType='a'
SPSS 15 programmability can create new datasets from
scratch


Cursor can have accessType='w'
SPSS 15 programmability can add cases directly


Useful when definitions can be expressed in SPSS syntax
Cursor can have accessType='n'
spssdata module on Developer Central updated to support
these modes
Copyright (c) SPSS Inc, 2006
trans and extendedTransforms
Modules

trans module facilitates plugging in Python code to
iterate over cases

Runs as an SPSS procedure




Passes the data
Adds variables to the SPSS variable dictionary
Can apply any calculation casewise
Use with

Standard Python functions (e.g., math module)
 Any user-written functions or appropriate classes
 Functions in extendedTransforms module
Copyright (c) SPSS Inc, 2006
trans and extendedTransforms
Modules

trans strategy
 Pass case data through Python code writing
result back to SPSS in new variables

extendedTransforms collection of ten functions to
apply to SPSS variables





Regular expression search/replace
Template-based substitution
soundex and nysiis functions for phonetic equivalence
Levenshtein distance function for string similarity
Date/time conversions based on patterns
Copyright (c) SPSS Inc, 2006
Python Regular Expressions

Pattern matching in text strings

If you use SPSS index or replace, you need these

Standardize string data (Mr, Mr., Herr, Senor,...)

Patterns can be simple strings (as with SPSS
index) or complex patterns

Pick out variable names with common parts
Copyright (c) SPSS Inc, 2006
Regular Expressions:
A Few Examples

"age" – string containing the letters age

"\wage" – string containing the word age

"abc|xyz|pqrst" = string containing any of abc etc

"\d+" – a string of any number of digits

"x.*y" – a string starting with x and ending with y

Can be case sensitive or not

Can greatly simplify code currently using SPSS index and
replace functions
Copyright (c) SPSS Inc, 2006
Using trans and extendedTransforms
search Function
import spss, trans, spssaux, extendedTransforms
spssaux.OpenDataFile("c:/data/names.sav")
tproc = trans.Tfunction(listwiseDeletion=True)
tproc.append(extendedTransforms.search, 'match','a8',
['names', trans.const('Peck|Pech|Pek')])
tproc.append(extendedTransforms.search, 'matchignorecase','a8',
['names', trans.const('peck'), trans.const(True)])
tproc.append(extendedTransforms.search, ('match2','startpos','length'),
('a12','f4.0','f4.0'), ['names', trans.const('Peck')])
tproc.execute()
spss.Submit("SELECT IF length > 0")
spssaux.SaveDataFile("c:/temp/namesplus.sav")
Run
Copyright (c) SPSS Inc, 2006
Using trans:
Writing Your Own Function
begin program.
import trans, re
def splitAndExtract(s):
"""split a string on "--" and return the left part and the number
in the right part. Ex: "simvastatin-- PO 80mg TAB" -> "simvastatin", 80"""
parts = s.split("--")
try:
number = re.search("\d+", parts[1]).group()
except:
number = None
return parts[0], number
tproc = trans.Tfunction()
tproc.append(splitAndExtract, ("name", "number"), ("a30", "f5.0"), ["medicine"])
tproc.execute()
end program.
Run
Copyright (c) SPSS Inc, 2006
extendedTransforms
soundex and nysiis

Algorithms for approximating phonetic equivalence of
names

soundexallwords can be used on unstructured text

Applied to database of 20,000+ surnames
import spss, trans, spssaux, extendedTransforms
spssaux.OpenDataFile("c:/data/names.sav")
tproc = trans.Tfunction()
tproc.append(extendedTransforms.soundex, 'soundex','a5', ['names'])
tproc.append(extendedTransforms.nysiis, 'nysiis', 'a20', ['names'])
tproc.execute()
spssaux.SaveDataFile("c:/temp/namesplusplus.sav")
Run
Copyright (c) SPSS Inc, 2006
Results
Copyright (c) SPSS Inc, 2006
soundex on Unstructured Text

(Overly) simple processing of unstructured text

Use soundex word by word to abstract spelling

No stemming, linguistic analysis etc


Use STAFS for serious work
Very simple to use
begin program.
import spss, trans, extendedTransforms
t = trans.Tfunction()
t.append(extendedTransforms.soundexallwords, 'allsoundexn66',
'a108', ['n_66'])
t.execute()
end program.
Copyright (c) SPSS Inc, 2006
soundex on Unstructured Text
Copyright (c) SPSS Inc, 2006
Creating a Graphical User
Interface

Python comes with Tkinter, a gui toolkit

There are better ones freely downloadable


E.g., wxPython
Visit wxpython.org

Very easy to do small user interactions

Examples



Message box
File chooser
Variable picker
Copyright (c) SPSS Inc, 2006
Simple Message Box Using
wxPython
Copyright (c) SPSS Inc, 2006
Simple File Chooser Using
wxPython
Copyright (c) SPSS Inc, 2006
Variable Picker Using wxPython
Copyright (c) SPSS Inc, 2006
Other New spss Module API’s

User-missing values



Pivot table API's




BasePivotTable
CellText
Dimension
Output Text block support


GetVarMissingValues
GetSPSSLowHigh
Good for writing comments to the Viewer
Miscellaneous



GetWeightVar
HasCursor
SplitChange
Copyright (c) SPSS Inc, 2006
Recap

SPSS 14 introduced major programmability features

SPSS 15 adds



Reading and writing case data: new variables; new cases
Creating pivot tables and text blocks
Writing first-class SPSS procedures

Bonus Pack and Partial Least Squares modules illustrate
these features

Developer Central improves ability to provide modules and
information

Will soon have four new SPSS 15 modules
Copyright (c) SPSS Inc, 2006
Questions
?
?
?
?
Copyright (c) SPSS Inc, 2006
SPSS 15:
The Revolution Continues

SPSS 15 programmability makes it easy to add
capabilities beyond what is already built in to SPSS

SPSS 15 makes it easier to build complete
applications on top of SPSS

SPSS 15 programmability makes you more
productive

SPSS 15 has lots of other great features, too

Try it out
Copyright (c) SPSS Inc, 2006
Write to Me!
Copyright (c) SPSS Inc, 2006
Download