Click here for the Jupyter notebook. NB: I adapted it from a Notebook created by Alex Aklson and Polong Lin. From the IBM Applied Data Science Capstone Project course Coursera.
The detailed report can be found from my github repo here
Johannesburg, informally known as Jozi, Joburg, or “The City of Gold”, is the largest city in South Africa and one of the 50 largest urban areas in the world¹. It is the provincial capital and largest city of Gauteng, which is the wealthiest province in South Africa. Johannesburg is the seat of the Constitutional Court, the highest court in South Africa. The city is located in the mineral-rich Witwatersrand range of hills and is the centre of large-scale gold and diamond trade. It was one of the host cities of the official tournament of the 2010 FIFA World Cup.
In this project, I analyzed different kinds of venues using the power of k-means clustering to seek the hidden patterns about the most visited venues in each of the suburbs within the City of Johannesburg municipality.
1.1 Business Problem
Suppose that there is a contractor trying to open a restaurant within the Johannesburg municipality, how can we use the current machine learning techniques to determine the suitable locations?
To begin answering the question, it is reasonable to ask what makes a good location to situate a restaurant at? These are some of the key features to consider when opening a new restaurant:
- Visibility: Urban areas tend to have high car and foot traffic. Locating a restaurant around towns would hence be a good choice. However, it could be possible to find places that offer high visibility for the restaurant, but also have high crime rates. Such areas are not best suited for family-style restaurant.
- Parking: Places near parking lots would be another good choice. It would be ideal to have a restaurant with it’s own parking lot.
- Accessibility: It could be beneficial to have a restaurant built across a road with a relatively low speed limit and high car traffic. Supposedly around freeway/highway exits. As for foot traffic, a location near urbanized areas would be ideal. Inside shopping malls, an ideal place to have a restaurant would be within or near food courts.
There are many other factors to also consider such as average income and the population of the area of interest. However, the goal of this project is to find out how urbanized an area is by finding out the most popular venues within that area (by 500 meters to be exact)and seek out hidden patterns that may reveal some additional information about a location.
1.2 The Data Set
For this project, the location data that I used was from https://adi45.carto.com/tables/metropolitan_suburbs_region/public/map
After downloading the geojson file from this website and loading the data set in a jupyter notebook, the data set contains information we need such as name of the province, suburb (also known as a neighborhood in Commonwealth countries) , main place, local municipality (also known as a borough in some English speaking countries) latitude and longitude coordinates of the locations. There is also other information such as the population of black people, colored people (a term referring to people of mixed race in South Africa) and white people. Although this information may be relevant when it comes to picking out which locations have the people with the highest average income and locations that may offer high foot traffic, however this can be misleading because a population density doesn’t necessarily mean more customers . So I did feature selection and decided to drop the population data. The table below shows the relevant features after feature selection.
With this location data, I then used the Foursquare API which is ‘a local search-and-discovery mobile app developed by Foursquare Labs Inc. The app provides personalized recommendations of places to go near a user’s current location based on users’ previous browsing history and check-in history.’ So basically this app can be used for location detection. As explained from the Foursquare Wikipedia page ‘When users opt in to always-on location sharing, Pilgrim determines a user’s current location by comparing historical check-in data with the user’s current GPS signal, cell tower triangulation, cellular signal strength and surrounding WiFi signals.’
So Foursquare uses one’s location information and visit frequency to “learn” what the user likes, which aims to improve user-facing recommendations and gauge the popularity of a venue.
With the location data I had to make calls to the Foursquare API to find the most common venues per suburb in the Johannesburg, by constructing a URL to send a request to the API to search for a specific type of venues and to explore a geographical location. Also, prior to this I used the visualization library, Folium, to visualize the Suburbs in Johannesburg and find out how big the size of the data set was. In the City of Johannesburg municipality, there were 659 Suburbs as shown on the map below (each of the blue points are a suburb within the local municipality).
Since I was analyzing the most common venues within 500 meters per suburb, I decided to limit the number of the most common venues to just 100 venues.
Since there are no labels in the data set for this particular problem, unsupervised learning learning is best suited to solve this problem. In particular, clustering algorithms such as k-means clustering and DBSCAN are good candidates for dealing with location data. The k-means clustering algorithm creates clusters automatically and takes the mean values of the instances to determine a cluster center. In general, the k-means clustering algorithm would require us to guess the initial number of k clusters, let’s say that we choose k=3. It proceeds by selecting k initial cluster centers and then iteratively refining them as follows:
1. Each instance Di is assigned to its closest cluster. (Cluster Assignment)
2. Each cluster center Cj is updated to be the mean of its constituent instances.²
The k-means clustering algorithm stops updating after the three clustering centers stop moving.
For this project, data pre-processing was necessary to use because the k-means clustering is sensitive to the features used. Actually if the data is not pre-processed, I found out that the optimal k values occur after k>300, but when the data is pre-processed, the optimal values were k=2, k=5 and k =7. Data cleaning was necessary, since there were suburbs that had no required location information, I deleted them. As mentioned before feature selection was done to select relevant features and drop redundant ones. Feature scaling, in particular normalization, was done as well to ensure that good clusters were generated and also that redundant data was neglected, so as to improve the effectiveness of the clustering algorithm.
Before running the k-means clustering algorithm, we need to choose the number of clustering centers. Either by inspection of the data or by using what is called the elbow-method.
Since there’s no clear elbow, we can then turn to comparing the silhouette scores of every value of k. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. Also note that range of k in the graph below is from 2 to 50, because 2 clusters are needed to compute the silhouette score.
As we can see, the optimal values for k are at k=2 , k = 6 and k = 11 and k =14. However choosing a low value for k such as k = 2 would lead to a loss of information (under-segmenting). For instance, cluster one may have pubs, parks, hotels and shopping malls dominating the cluster equally, while cluster two may have restaurants, stadiums and parks dominating the cluster equally. The goal is to seek hidden patterns within clusters. In other words having clusters that reveal clear patterns. However choosing k=11 or 14 could possibly lead to a loss of interpretability (over-segmenting) and also, we can see that the silhouette scores for k=11 and 14 are far worse than that for k=2. Hence choosing k = 6 is a better choice.
So, after running the k-means algorithm with k = 6, the map showing clusters was obtained. Shown below.
Table 1 below is a list of the most frequently appearing venues and main places for each cluster and the numbers in the brackets are the modes of the respective venue and main place. The Jupyter notebook contains a list of all results. However, the list most significant results are in the ‘1st Most Common Venue’ and ‘2nd Most Common Venue’ columns. Main places contain the suburbs that we may want to look at. In other words, main places are a compact way of grouping suburbs. Also to note, Johannesburg is both a city and a main place. Even more, sometimes the City of Johannesburg municipality is referred to as Johannesburg.
As we can see, cluster 0 is dominated by fast foot restaurants. This seems plausible because there are a lot of such restaurants in the main place Johannesburg location. This cluster suggests that there is a lot of foot traffic for the Johannesburg locations belonging to this cluster. Although there is competition, this cluster still offers good locations for opening a restaurant. As an example, food courts in shopping malls have competing restaurants next to each other, but due to the foot traffic, the restaurants are able to survive. Actually since people dislike long queues, there is a good chance that people will go to the next-door less occupied restaurant once the well known one becomes too full or has queues that are too long.
Cluster 1 is dominated by the grocery store and yoga studio venue. Also, the third most common venue is a shopping mall. This would be a good cluster for someone who wants to open a restaurant within a shopping mall that is within the City of Johannesburg Municipality. Yoga studios generally do not occupy many people compared to the grocery store and shopping mall, but since yoga classes in South Africa are somewhat expensive, the yoga studio locations suggest that those locations have high income.
Cluster 2 is largely dominated by the construction and landscaping venues. This seems plausible since there are still some underdeveloped locations called “townships” in some locations of Johannesburg. Some of these locations tend to be closer to wealthy suburbs and such an example is the township called Alexandra which is very close to the Sandton main place suburbs. This possible explains why the third most common venue in this cluster was yoga studio.
Cluster 3 is dominated by the gas station locations. This is,in general, not the best cluster for opening a restaurant although it a suggests a lot of car traffic. In locations such as Soweto, most people rely on public transport for transportation. Minibuses which tend to occupy 15 people and buses which can occupy 60+people. However due to the high car traffic in this cluster, it may perhaps be great for advertising the restaurant with billboards. However, the restaurant could be built closer to the yoga studios than the gas stations.
Cluster 4 and 5 are somewhat identical to each other. Unlike cluster 1 and 3, the most common venues are restaurants, not a grocery store or gas station. Again, having a restaurant next to competitors is not necessarily a bad idea.
Thus cluster 4 and 5 still make good location for opening a restaurant. However, in cluster 4 people like going to restaurants in general while in cluster 5 people like African restaurants.
Cluster 0 , 4 and 5 make great locations for opening a restaurant. On the map, these are suburbs highlighted in red, green and orange respectively. These locations on the map would then need to be checked for the visibility and the accessibility of the restaurant when it’s built. Furthermore, since locations near the highway exits can make good locations for a new restaurant, such locations can now be found on the map.
Also, although there is no cluster clearly dominated by football stadiums, Looking at the maximum capacity of the stadiums and the fact that concerts can also take place, it seems like a good idea to have a restaurant either advertised or built not faraway a major stadium. Due to the foot and car traffic that may occur on the weekdays and special occasions.
It would be interesting to see how many and what kind of clusters the DBSCAN algorithm will produce, given the right values of the hyper-parameters. The DBSCAN algorithm is also well suited for being used on location data and it does not require us to guess the number of clustering centers $k$ beforehand. This project may also be altered to solve other types of business problems such as where in Johannesburg to open an office. However, I believe that a different dataset may have to be used to determine a suitable office location. Specifically, using the speed profiles, traffic density and statistics data sets for the City of Johannesburg municipality may help solve this business problem.
Please share this post if you liked it. Thank you. :)
 Wagstaff, K., Cardie, C., Rogers, S. and Schrödl, S., 2001, June. Constrained k-means clustering with background knowledge. In Icml (Vol. 1, pp. 577–584).