Capstone Project Blog Post

IBM Data Science Capstone Project

Introduction

A.1. Description & Discussion of the Background

Mexico City is one of the largest metropolises in the world, with a population in excess of 25 million. Having lived there in the past, I decided to do my capstone project in the city. (in my current home of Oaxaca, there is not enough data on the Foursquare app to allow for an interesting analysis).

For this project, I will answer the question, "In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it?"

In addition to being one of the biggest cities in the world, Mexico City also has an enormous number of restaurants. By analyzing the results from the Foursquare App and comparing it to data about the city’s boroughs (scraped from Wikipedia) I will try to find the optimum location for a new restaurant.

This will be aimed at an audience of investors that would like to find the best location for their restaurant in Mexico City. They would be interested in the problem as they would want to find the optimum location to set up their restaurant, so as to have the highest chance of returning a profit on the investment.

To solve this problem, I will create a map of the concentrations of different restaurants across the city and try to find the optimum location.

A.2. Data Description

This project will require the following data to solve:

Location data of Mexico City. I will pull the info from Geolocator, and this will be used to orient the first map in Folium
The names of the boroughs that make up the city. This will be scraped from Wikipedia.
Location data for each borough. Once I have all the names of the boroughs, these will be pulled from Geolocator
List of results from the foursquare app. The term “restaurants” will be searched in each borough, and then mapped to Mexico City.

For example, if I search “restaurants” in the center of Mexico City, I will get a list of 50 of the most popular places close to that location. This will include the name, restaurant type, and exact location of the restaurant. Foursquare provides this information in the form of a JSON, which I will convert to a pandas data frame.

KMeans cluster of Restaurant Locations to find similar areas to invest in.

One problem I have come across is Foursquare’s API has a new limit on results, meaning only a maximum of 50 results can be returned per request. As such, I will do multiple requests from different locations across the city and link them together in a data frame.

B. Methodology

My data relating to Mexico City’s boroughs was scraped from Wikipedia and contains the names, populations and population densities of each borough. The names were then passed to Geolocator to extract the exact latitude and longitude.

The coordinates of each of these boroughs was then put into a Foursquare API request with the search term “restaurant”. As Foursquare only allows a maximum of 50 results, I tested different limits to see which got me the most comprehensive set of results without too much overlap. I found the optimum to be 5000m – although this may seem large, this was just enough to make sure all areas of the city were covered, as several boroughs are themselves very large.

By exporting all results to a single dataframe, cleaning the results, and removing duplicates, I was left with 328 results, which mapped onto the city as shown below:

In addition to the coordinates, Foursquare provides the following information about the venues:

I explored the results set by making a pivot table to count the number of different types of restaurants. Unfortunately, Foursquare data in Mexico City is limited and almost all the restaurants are in the categories “Mexican Restaurants” or “Restaurant” which doesn’t tell you much about the style of food available. This means I will be unable to make any significant inference as to what type of restaurant I’d recommend putting in any location.

To explore the pattern of where restaurants are located, I applied an unsupervised learning K-means algorithm to cluster the restaurants. This is one of the most common cluster methods of unsupervised learning.

First, I used the KElbowVisualizer tool to find the optimum number of clusters with which to separate the restaurants. In the original iteration, it found that 5 clusters was optimum. However, when I applied the method and mapped the results in Folium, I noticed that Foursquare’s API had provided 7 results that were in completely different cities. So, I had to remove these from the dataset and re-run the KElbow Method. The new optimum was k=4, and the results are shown below.

Applying this value of K=4 to the KMeans algorithm, and then assigning a color value to each restaurant generates the following map in Folium.

Once clustered, I then explored each cluster to see if any cluster had a significant outlier in terms of restaurant types. Unfortunately, due to the data limitation mentioned above, no single area stood out. So, I concluded that the best place to put a restaurant would be wherever there was the greatest concentration of restaurants already. While this may seem counter-intuitive, the existence of a listing for any restaurant on foursquare indicates that it has been successful, so the location is obviously suitable for that kind of business.

The cluster with the most restaurants has a centroid at (19.44016526, -99.1532429) so I wrote a function to find the nearest borough to the centroid

C. RESULTS

After processing, the borough nearest the centroid of the cluster with the most restaurants was Venustiano Carranza.

D. Discussion.

Due to the lack of restaurant data on Foursquare, and the limitations on data pulls from the API, this can only be considered a provisional result. In addition, data on average rental prices would make a valuable addition to this analysis. A better approach may be to use Google Maps’ API, but that was beyond the scope of this exercise.

E. Conclusion

From this analysis, the best borough to start a new restaurant in Mexico City is Venustiano Carranza. This is a highly populated and prosperous neighbourhood in the centre of the capital, so would be highly appealing to stakeholders, and likely to represent a long term success.

Gracias!!!

Search This Blog

IBM Data Science Course

Capstone Project Blog Post

Comments

Post a Comment