< Back

Lecture 0- CPSC170 Setup and CPSC120 Review


Let's get started by setting up your directory structure for CPSC170. You will need to create a couple of directories:

cs170/
cs170/assignments
cs170/labs
cs170/exams

Remember, you can use the mkdir command to make directories from the command line. And you should be using the command line, as opposed to the GUI.

All materials you write for a given class will be stored in your labs directory.

While we are discussing it, you might as well create a directory for today's lab:

$ cd ~/cs170/labs
$ mkdir lab0
$ cd lab0

CPSC120 Review

I have learned (from experience) that programming skills atrophy over the short Winter break. So, we are going to spend today simply as a review day. Nothing new is being taught today, we are just trying to knock the rust off of your skills.


Lab Assignment 1
K-Means Clustering

We are constantly being bombarded with data. It ranges from your grades, credit scores, blood pressure, etc. Making sense of data in an automated fashion is known as data mining. Given enough data, data mining tries to find a pattern to deduce meaningful results.

One of the more simple forms of data mining is known as cluster analysis. This attempts to divide the data into groups known as clusters, where each group shares some property. One of the easiest to understand clustering algorithms is known as K-means, and it happens to be one of the oldest clustering techniques.

The algorithm for K-means is simple. First, choose some number k. This number will represent the number of clusters you will use. Next, you randomly choose k of your data-points to represent the initial centers of the clusters.

You will then assign each data-point to a cluster, based on its distance to the center of the cluster. Once you have assigned all data-points to a cluster, you recompute the central points, and re-assign all of the data-points to their new clusters. You continue this process until the clusters no longer change.

For this assignment, you will read in a set of 2-dimensional Cartesian coordinates, and generate clusters based on that data.

Details

Create a program in a file called k_means_clustering.py in Emacs. This program should read a set of Cartesian coordinates from the user, and print the lists of the clusters.

Your input will be of the following format:

number of points
x-coord y-coord
x-coord y-coord
x-coord y-coord

Where the number of points, and the x- and y-coordinates are all integers. Note that the x- and y-coordinates are separated by a single space.

Your output should be of the following format:

Cluster 0: [(x, y), (x, y), (x, y), ...]
Cluster 1: [(x, y)]
.
.
.
Cluster K - 1: ...

Example

$ cat test_1.in
17
100 59
85 90
95 58
86 62
80 79
96 74
75 47
97 72
67 77
99 100
100 100
55 55
100 56
55 65
80 69
57 54
95 91


$ python3 k_means_clustering.py < test_2.in
Cluster 0:  [(75, 47), (57, 54)]
Cluster 1:  [(67, 77), (55, 55), (55, 65)]
Cluster 2:  [(100, 59), (95, 58), (86, 62), (96, 74), (97, 72), (100, 56), (80, 69)]
Cluster 3:  [(85, 90), (80, 79), (99, 100), (100, 100), (95, 91)]
$

The above example is sample grade data using 4 clusters. Keep in mind that your output might be slighly different than mine because of how the initial centroids are randomly computed. What do you suspect the clusters represent?

The file input_1.in contains sample data that you can use for your clustering application. This file represents a persons age to the amount of money in their bank account. Try this file using 4 clusters. What do you suspect the clusters represent? The data is correlated with the data in input_2. Any ideas what the second column in this file represents?

 

Challenge

We have 2-dimensional Cartesian coordinates, and you also know how you can plot those points in a graphical window. Wouldn't it make more sense to show the clusters graphically, than by just printing them out?

Add a function to your program that graphically displays the clusters. You should plot the data-points in one color, then draw a circle (in a different color) around the clusters. Make sure all of the data-points exist in the circle for a given cluster!


Submission

When you have finished, create a tar file of your lab0 directory. To create a tar file, execute the following commands:

cd ~/cs170/labs
tar czvf lab0.tgz lab0/

To submit your activity, go to inquire.roanoke.edu. You should see an available assignment called Lab Assignment 0. Make sure you include a header listing the authors of the file.