Solution for this lab's assignment has been posted here.
Let's get started by setting up your directory structure for CPSC170. You will need to create a couple of directories:
cs170/ cs170/assignments cs170/labs cs170/exams
Remember, you can use the mkdir
command to make
directories from the command line. And you should be using the
command line, as opposed to the GUI.
All materials you write for a given class will be stored in your labs directory. For those of you from CPSC120B, this will be a slight change for you. For those of you from CSPC120A, this is likely what you did all semester.
While we are discussing it, you might as well create a directory for today's lab:$ cd ~/cs170/labs $ mkdir lab1 $ cd lab1
I have learned (from experience) that programming skills atrophy over the short Winter break. So, we are going to spend today simply as a review day. Nothing new is being taught today, we are just trying to knock the rust off of your skills.
We are constantly being bombarded with data. It ranges from your grades, credit scores, blood pressure, etc. Making sense of data in an automated fashion is known as data mining. Given enough data, data mining tries to find a pattern to deduce meaningful results.
One of the more simple forms of data mining is known as cluster analysis. This attempts to divide the data into groups known as clusters, where each group shares some property. One of the easiest to understand clustering algorithms is known as K-means, and it happens to be one of the oldest clustering techniques.
The algorithm for K-means is simple. First, choose some number k. This number will represent the number of clusters you will use. Next, you randomly choose k of your data-points to represent the initial centers of the clusters.
You will then assign each data-point to a cluster, based on its distance to the center of the cluster. Once you have assigned all data-points to a cluster, you recompute the central points, and re-assign all of the data-points to their new clusters. You continue this process until the clusters no longer change.
For this assignment, you will read in a set of 2-dimensional Cartesian coordinates, and generate clusters based on that data.
Create a program in a file called k_means_clustering.py in Emacs. This program should read a set of Cartesian coordinates from the user, and print the lists of the clusters.
Your input will be of the following format:
number of points x-coord,y-coord x-coord,y-coord x-coord,y-coord
Where the number of points, and the x- and y-coordinates are all integers. Note that the x- and y-coordinates are separated by a single comma.
Your output should be of the following format:
Cluster 0: [(x, y), (x, y), (x, y), ...] Cluster 1: [(x, y)] . . . Cluster K - 1: ...
$ cat test_1.in 17 100,59 85,90 95,58 86,62 80,79 96,74 75,47 97,72 67,77 99,100 100,100 55,55 100,56 55,65 80,69 57,54 95,91 $ python3 k_means_clustering.py < test_2.in Cluster 0: [(75, 47), (57, 54)] Cluster 1: [(67, 77), (55, 55), (55, 65)] Cluster 2: [(100, 59), (95, 58), (86, 62), (96, 74), (97, 72), (100, 56), (80, 69)] Cluster 3: [(85, 90), (80, 79), (99, 100), (100, 100), (95, 91)] $
The above example is sample grade data using 4 clusters. Keep in mind that your output might be slighly different than mine because of how the initial centroids are randomly computed. What do you suspect the clusters represent?
The file input_1.in contains sample data that you can use for your clustering application. This file represents a persons age to the amount of money in their bank account. Try this file using 4 clusters. What do you suspect the clusters represent? The data is correlated with the data in input_2. Any ideas what the second column in this file represents?
Define two constants at the top of your file: K
and
EPSILON
. K
Will represent the number
of clusters your program will generate. EPSILON
will represent maximum amount of change necessary to re-run the
clustering algorithm: If the amount the center of the clusters
change after a particular iterating of the clustering algorithm
is less than EPSILON
, your program can stop.
Create a function called read_initial_data()
. This
function will read the input from the user, and returns a list of
tuples, the Cartesian coordinates. Recall that you can use the
int
function to convert a string to an
integer, and you can use the
split
method of strings to get a
list of the tokens of the string split on a given character.
Remember, you need to import the random
module in order to "randomly" generate values.
You should create a list of size k to represent the clusters. Each element in this list will be a list of tuples. So this is a 2-dimensional, jagged list.
Create a function called euclidean_distance(p1,
p2)
, which takes as parameters two tuples representing
Cartesian coordinates. This function should return a floating
point value, the euclidean distance between the two parameter points.
You will use this function not only to determine which cluster a particular point belongs, but also to determine how much the cluster centers actually change.
Create a function called compute_centroid(points)
.
This function will take as a parameter a list of tuples
representing the points in a cluster. This function will return
a tuple, the central point of the cluster.
You can compute the central point by computing the mean of each dimension of your Cartesian coordinate.
We have 2-dimensional Cartesian coordinates, and you also know how you can plot those points in a graphical window.
Add a function to your program that graphically displays the clusters. You should plot the data-points in one color, then draw a circle (in a different color) around the clusters. Make sure all of the data-points exist in the circle for a given cluster!
When you have finished, create a tar file of your lab1
directory. To create a tar file, execute the following commands:
cd ~/cs170/labs tar czvf lab1.tgz lab1/
To submit your activity, go to cseval.roanoke.edu. You should
see an available assignment called Lab Assignment 1
.
Make sure you include a header listing the authors of the file.