CPSC 170 A

Lecture 1 - CPSC170 Setup and CPSC120 Review

Solution for this lab's assignment has been posted here.

Let's get started by setting up your directory structure for CPSC170. You will need to create a couple of directories:

cs170/
cs170/assignments
cs170/labs
cs170/exams

Remember, you can use the mkdir command to make directories from the command line. And you should be using the command line, as opposed to the GUI.

All materials you write for a given class will be stored in your labs directory. For those of you from CPSC120B, this will be a slight change for you. For those of you from CSPC120A, this is likely what you did all semester.

While we are discussing it, you might as well create a directory for today's lab:

$ cd ~/cs170/labs
$ mkdir lab1
$ cd lab1

CPSC120 Review

I have learned (from experience) that programming skills atrophy over the short Winter break. So, we are going to spend today simply as a review day. Nothing new is being taught today, we are just trying to knock the rust off of your skills.

Lab Assignment 1

K-Means Clustering

We are constantly being bombarded with data. It ranges from your grades, credit scores, blood pressure, etc. Making sense of data in an automated fashion is known as data mining. Given enough data, data mining tries to find a pattern to deduce meaningful results.

One of the more simple forms of data mining is known as cluster analysis. This attempts to divide the data into groups known as clusters, where each group shares some property. One of the easiest to understand clustering algorithms is known as K-means, and it happens to be one of the oldest clustering techniques.

The algorithm for K-means is simple. First, choose some number k. This number will represent the number of clusters you will use. Next, you randomly choose k of your data-points to represent the initial centers of the clusters.

You will then assign each data-point to a cluster, based on its distance to the center of the cluster. Once you have assigned all data-points to a cluster, you recompute the central points, and re-assign all of the data-points to their new clusters. You continue this process until the clusters no longer change.

For this assignment, you will read in a set of 2-dimensional Cartesian coordinates, and generate clusters based on that data.

Details

Create a program in a file called k_means_clustering.py in Emacs. This program should read a set of Cartesian coordinates from the user, and print the lists of the clusters.

Your input will be of the following format:

number of points
x-coord,y-coord
x-coord,y-coord
x-coord,y-coord

Where the number of points, and the x- and y-coordinates are all integers. Note that the x- and y-coordinates are separated by a single comma.

Your output should be of the following format:

Cluster 0: [(x, y), (x, y), (x, y), ...]
Cluster 1: [(x, y)]
.
.
.
Cluster K - 1: ...

Example

$ cat test_1.in
17
100,59
85,90
95,58
86,62
80,79
96,74
75,47
97,72
67,77
99,100
100,100
55,55
100,56
55,65
80,69
57,54
95,91


$ python3 k_means_clustering.py < test_2.in
Cluster 0:  [(75, 47), (57, 54)]
Cluster 1:  [(67, 77), (55, 55), (55, 65)]
Cluster 2:  [(100, 59), (95, 58), (86, 62), (96, 74), (97, 72), (100, 56), (80, 69)]
Cluster 3:  [(85, 90), (80, 79), (99, 100), (100, 100), (95, 91)]
$

The above example is sample grade data using 4 clusters. Keep in mind that your output might be slighly different than mine because of how the initial centroids are randomly computed. What do you suspect the clusters represent?

The file input_1.in contains sample data that you can use for your clustering application. This file represents a persons age to the amount of money in their bank account. Try this file using 4 clusters. What do you suspect the clusters represent? The data is correlated with the data in input_2. Any ideas what the second column in this file represents?

Hint

Define two constants at the top of your file: K and EPSILON. K Will represent the number of clusters your program will generate. EPSILON will represent maximum amount of change necessary to re-run the clustering algorithm: If the amount the center of the clusters change after a particular iterating of the clustering algorithm is less than EPSILON, your program can stop.
Create a function called read_initial_data(). This function will read the input from the user, and returns a list of tuples, the Cartesian coordinates. Recall that you can use the int function to convert a string to an integer, and you can use the split method of strings to get a list of the tokens of the string split on a given character.
Remember, you need to import the random module in order to "randomly" generate values.
You should create a list of size k to represent the clusters. Each element in this list will be a list of tuples. So this is a 2-dimensional, jagged list.
Create a function called euclidean_distance(p1, p2), which takes as parameters two tuples representing Cartesian coordinates. This function should return a floating point value, the euclidean distance between the two parameter points.

You will use this function not only to determine which cluster a particular point belongs, but also to determine how much the cluster centers actually change.
Create a function called compute_centroid(points). This function will take as a parameter a list of tuples representing the points in a cluster. This function will return a tuple, the central point of the cluster.

You can compute the central point by computing the mean of each dimension of your Cartesian coordinate.

Challenge

We have 2-dimensional Cartesian coordinates, and you also know how you can plot those points in a graphical window. Wouldn't it make more sense to show the clusters graphically, than by just printing them out?

Add a function to your program that graphically displays the clusters. You should plot the data-points in one color, then draw a circle (in a different color) around the clusters. Make sure all of the data-points exist in the circle for a given cluster!

Submission

When you have finished, create a tar file of your lab1 directory. To create a tar file, execute the following commands:

cd ~/cs170/labs
tar czvf lab1.tgz lab1/

To submit your activity, go to cseval.roanoke.edu. You should see an available assignment called Lab Assignment 1. Make sure you include a header listing the authors of the file.

Last modified: Mon Jan 13 20:06:37 EST 2014