By clustering neighborhoods based on popular venues surrounding them
ml
sklearn
Published
September 10, 2021
I am a foodie looking for a place stay in Bahrain. I want to study certain areas in Bahrain and the kind of restaurants that surround them.
I think that a lot of people, not just the youth could benefit from this, since the issue isn’t just finding a decent place to stay in Bahrain, but finding one that best serves their culinary interests perhaps.
I mean there are obviously much better factors to look at besides food. However for this problem I want to stick to what I can gain from Foursquare with a free license. Thus, by neatly categorizing areas based on their attributes such as frequency of coffee shops, closeness to malls etc; I can can make a better guesstimate of where they might stay.
Foursquare allows us to grab information on venues surrounding a given location, and therefore we will look into the most frequent kind of venues surrounding a given area, and cluster areas them on that.
So let’s get started!
Scrap Bahrain Cities/Town Data from Wikipedia
I need to scrap data from Wikipedia to lookup towns and cities in Bahrain. We’re going to use the popular webscraper Beautiful Soup to do that.
url ='https://en.wikipedia.org/wiki/Category:Populated_places_in_Bahrain'html_doc = requests.get(url).text # Get HTML Docsoup = BeautifulSoup(html_doc, 'html.parser') # Parse using bs4blocks = soup.find_all("div", {"class": "mw-category-group"})[1:]bh_data=[]for block in blocks: places = block.find('ul').find_all('li')for place in places: bh_data.append(place.a.text.split(',')[0])bh_data = pd.DataFrame(bh_data, columns=['Area'])remove_places = ['Rifa and Southern Region', 'Northern City'] # Exclude these placesbh_data = bh_data[bh_data['Area'].apply(lambda item : item notin remove_places)].reset_index(drop=True)bh_data.head(5)
Area
0
A'ali
1
Abu Baham
2
Abu Saiba
3
Al Garrya
4
Al Hajar
So there are about 82 areas in Bahrain to study.
Retrieving Coordinates via a Geocoder
After that, we need to geocode them; convert them from a simple address to latitude & longitude values.
Popular geocoders like OpenStreetMap & Map Quest will be used.
import osapikey ="API-KEY-XXXXXXXXXXX"import geocoderlats = []lngs = []for city in bh_data['Area']: geocoder_type ='osm'try: g = geocoder.osm(f"{city}, Bahrain", key=apikey) geodata = g.json lats.append(geodata['lat'])except: geocoder_type ='MAPQUEST' g = geocoder.mapquest(f"{city}, Bahrain", key=apikey) geodata = g.json lats.append(geodata['lat']) lngs.append(geodata['lng'])print(city, "|", geocoder_type)bh_data['Latitude'] = latsbh_data['Longitude'] = lngs
These are the first few of them that were geocoded!
Area
Latitude
Longitude
0
A'ali
26.154454
50.527364
1
Abu Baham
26.205737
50.541668
2
Abu Saiba
30.325299
48.266157
3
Al Garrya
26.232690
50.578110
4
Al Hajar
26.225405
50.590138
Visualization on a Map
We will now use Folium to visualize the map of Bahrain along with each area as points on the map
# create map of Bahrain using latitude and longitude valueslatitude, longitude =26.0766404, 50.334118map_bahrain = folium.Map(location=[latitude, longitude], zoom_start=10)# add markers to mapfor lat, lng, city inzip(bh_data['Latitude'], bh_data['Longitude'], bh_data['Area']): label = folium.Popup(city, parse_html=True) folium.CircleMarker( [lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='#3186cc', fill_opacity=0.7, parse_html=True).add_to(map_bahrain) map_bahrain
Make this Notebook Trusted to load map: File -> Trust Notebook
Foursquare: Exploring Areas for Food Places
Futhermore, we’ll leverage the Foursquare API to gather the most common types of restaurants associated with an area within 500m of its center. We’ll then look at various food places and restaurants and extract their types for further analysis.
Note: To filter only restaurants & food places, we will use the specific “Food” category ID : 4d4b7105d754a06374d81259
food_categoryId ="4d4b7105d754a06374d81259"
Alright, let’s look at all food places surrouding the first area within a 500m radius
Thie first venue is Costa Coffee, and has a category: Coffee Shop
So now let’s build a helpful function to extract the category of each food place. We’ll use the same area as an example.
# function that extracts the category of the restaurantdef get_category_type(row):try: categories_list = row['categories']except: categories_list = row['venue.categories']iflen(categories_list) ==0:returnNoneelse:return categories_list[0]['name']venues = results['response']['venues']nearby_food = pd.json_normalize(venues) # flatten JSON# filter columnsfiltered_columns = ['name', 'categories', 'location.lat', 'location.lng']nearby_food = nearby_food.loc[:, filtered_columns]# filter the category for each rownearby_food['categories'] = nearby_food.apply(get_category_type, axis=1)# clean columnsnearby_food.columns = [col.split(".")[-1] for col in nearby_food.columns]nearby_food.head()
name
categories
lat
lng
0
Costa Coffee
Coffee Shop
26.157464
50.525873
1
Chilis Aali
Diner
26.152996
50.526268
2
Loop Cafe
Café
26.156017
50.531527
3
Hospital Resturant (كافيتيريا المستشفى)
Restaurant
26.153012
50.526232
4
كفتيريا المستشفى
Restaurant
26.153455
50.528375
These are some of them, in total it returns 19 food places around A’ali.
Exploring All Areas
We’ve got still got 82 places to explore, so let’s create a function to do this task much faster.
def getNearbyFoods(names, latitudes, longitudes, radius=500): food_categoryId ="4d4b7105d754a06374d81259" foods_list=[]for name, lat, lng inzip(names, latitudes, longitudes):print(name)# create the API request URL url ='https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format( CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, food_categoryId, radius, LIMIT)# make the GET requesttry: results = requests.get(url).json()["response"]['venues']except:print(results)raiseKeyError venue_list = []# return only relevant information for each nearby food placefor v in results: vname, vlat, vlng = v['name'], v['location']['lat'], v['location']['lng']try: vcategory = v['categories'][0]['name'] venue_list.append((name, lat, lng, vname, vlat, vlng, vcategory))except:continue foods_list.append(venue_list) nearby_foods = pd.DataFrame([item for venue_list in foods_list for item in venue_list]) nearby_foods.columns = ['Area', 'Area Latitude', 'Area Longitude','Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']return(nearby_foods)
We run the above function on each area and create a new dataframe called bh_foods.
We’ve trimmed out the remaining areas for brevity’s sake.
What’s interesting to notice from this data, is that there are 88 unique categories for food.
Some of them include: Coffee Shop, Diner, Café, Restaurant, Breakfast Spot and so on.
Most Common Food Places
Our solution relies on segmenting areas based on the most common type of food places within that area. This gives us an idea about the kind of area it is from a culinary point-of-view, and allowing us to make judgments on whether the food is ideal to our taste or not. We also want to factor in the total number of food places within an area since some places in Bahrain may not be ideal to live in if they don’t even have enough places to eat.
Using the dataframe bh_food, we form a one-hot encoding of the Venue Category field that produces new columns for each category. Each record in this table corresponds to a certain venue and a 1 is placed in the category field for that area. The only other field that is retained is the area name. We will call this bh_onehot.
# one hot encodingbh_onehot = pd.get_dummies(bh_food[['Venue Category']], prefix="", prefix_sep="")# add neighborhood column back to dataframebh_onehot = pd.concat([bh_food[['Area']], bh_onehot], axis=1) bh_onehot.head()
Area
Afghan Restaurant
African Restaurant
American Restaurant
Arepa Restaurant
Asian Restaurant
BBQ Joint
Bagel Shop
Bakery
Bistro
Breakfast Spot
Bubble Tea Shop
Buffet
Burger Joint
Burrito Place
Cafeteria
Café
Chaat Place
Chinese Restaurant
Coffee Shop
College Lab
Comfort Food Restaurant
Creperie
Cuban Restaurant
Cupcake Shop
Deli / Bodega
Dessert Shop
Diner
Doner Restaurant
Donut Shop
Dumpling Restaurant
Eastern European Restaurant
Egyptian Restaurant
Falafel Restaurant
Farmers Market
Fast Food Restaurant
Filipino Restaurant
Fish & Chips Shop
Food
Food Court
Food Truck
French Restaurant
Fried Chicken Joint
Frozen Yogurt Shop
Gas Station
Gastropub
Greek Restaurant
Halal Restaurant
Hookah Bar
Hot Dog Joint
Ice Cream Shop
Indian Restaurant
Iraqi Restaurant
Italian Restaurant
Japanese Restaurant
Juice Bar
Kebab Restaurant
Korean Restaurant
Lebanese Restaurant
Mediterranean Restaurant
Mexican Restaurant
Middle Eastern Restaurant
Moroccan Restaurant
Movie Theater
New American Restaurant
Noodle House
Pastry Shop
Persian Restaurant
Pie Shop
Pizza Place
Portuguese Restaurant
Restaurant
Salad Place
Sandwich Place
Seafood Restaurant
Shawarma Place
Snack Place
South Indian Restaurant
Steakhouse
Supermarket
Sushi Restaurant
Tea Room
Thai Restaurant
Theme Restaurant
Tibetan Restaurant
Turkish Restaurant
Vegetarian / Vegan Restaurant
Vietnamese Restaurant
Wings Joint
0
A'ali
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
A'ali
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
A'ali
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
A'ali
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
A'ali
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Now, let’s group rows by area and by taking the mean of the frequency of occurrence of each category, along with the number of food places surrouding it (NumberOfFoodPlaces).
Looking at the number of food places is significant considering that some areas have fewer restaurants, and could be a valid factor to segment, if a “foodie” is looking for a place to stay.
Let’s call this this bh_grouped. Now that we have this processed information, we can analyze this data more clearly by reordering it so that only the 10 most common type of food places for an area are retained.
# Function to sort venues by most common onesdef return_most_common_venues(row, num_top_venues): row_categories = row.iloc[1:-1] row_categories_sorted = row_categories.sort_values(ascending=False)return row_categories_sorted.index.values[0:num_top_venues]num_top_venues =10indicators = ['st', 'nd', 'rd']# create columns according to number of top venuescolumns = ['Area', 'NumberOfFoodPlaces']for ind in np.arange(num_top_venues):try: columns.append('{}{} Most Common Food Place'.format(ind+1, indicators[ind]))except: columns.append('{}th Most Common Food Place'.format(ind+1))# create a new dataframefoods_sorted = pd.DataFrame(columns=columns)foods_sorted[['Area','NumberOfFoodPlaces']] = bh_grouped[['Area','NumberOfFoodPlaces']]for ind in np.arange(bh_grouped.shape[0]): foods_sorted.iloc[ind, 2:] = return_most_common_venues(bh_grouped.iloc[ind, :], num_top_venues)# Get the count foods_sorted.head()
Area
NumberOfFoodPlaces
1st Most Common Food Place
2nd Most Common Food Place
3rd Most Common Food Place
4th Most Common Food Place
5th Most Common Food Place
6th Most Common Food Place
7th Most Common Food Place
8th Most Common Food Place
9th Most Common Food Place
10th Most Common Food Place
0
A'ali
19
Café
Restaurant
Coffee Shop
Cupcake Shop
Breakfast Spot
Food
Sandwich Place
Falafel Restaurant
Middle Eastern Restaurant
Bakery
1
Abu Baham
9
Middle Eastern Restaurant
Cafeteria
Ice Cream Shop
Donut Shop
BBQ Joint
Restaurant
Fish & Chips Shop
Mediterranean Restaurant
Afghan Restaurant
New American Restaurant
2
Al Daih
50
Middle Eastern Restaurant
Bakery
Breakfast Spot
Dessert Shop
Café
Restaurant
Turkish Restaurant
Burger Joint
Diner
Steakhouse
3
Al Dair
9
Bakery
Restaurant
BBQ Joint
Italian Restaurant
Fast Food Restaurant
Afghan Restaurant
Lebanese Restaurant
Noodle House
New American Restaurant
Movie Theater
4
Al Garrya
50
Restaurant
Breakfast Spot
Indian Restaurant
Coffee Shop
Filipino Restaurant
Café
Fried Chicken Joint
Fast Food Restaurant
Diner
Middle Eastern Restaurant
Let’s call this table foods_sorted.
Cluster Areas
Now we are ready for further analysis and clustering. We will use the bh_grouped dataframe since it contains the necessary numerical values for machine learning.
Our feature set is comprised of all the food categories (10 features).
We are excluding the NumberOfFoodPlaces feature as input to the ML model, since our problem requires segmenting areas by the type of food available. This quantity is only relevant to us to finally decide whether to live in an area or not.
A more concrete reason to exclude it, is the fact that there are all sorts of factors involved that we’re neglecting due to lack of data, such as living costs, access to public transport etc.
This is a foodie’s guide to finding a place, and this venture shouldn’t be bogged-down by the fact that there are sometimes fewer number of restaurants than one would expect.
Our target value will be cluster labels.
For our machine learning analysis, we will use the simplest clustering algorithm to separate the areas which is K-Means Clustering; an unsupervised machine learning approach to serve our purpose. We’ll use the popular machine learning library Sci-Kit Learn to do that in python.
We’ll run k-means to group the areas into 5 clusters. We pick this number for the sake of examination. We’ll fit the model on the entire data to learn these clusters.
# set number of clusterskclusters =5bh_grouped_clustering = bh_grouped.drop(['Area','NumberOfFoodPlaces'], 1)# run k-means clusteringkmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bh_grouped_clustering)# check cluster labels generated for each row in the dataframekmeans.labels_[0:10]
Let’s create a new dataframe bh_merged that includes the cluster as well as the top 10 food places for each area.
# add clustering labelstry: foods_sorted.insert(0, 'Cluster Labels', kmeans.labels_)except:# Allows me to retry if the Cluster Labels column exists foods_sorted['Cluster Labels'] = kmeans.labels_bh_merged = bh_data# merge bh_grouped with bh_data to add latitude/longitude for each neighborhoodbh_merged = bh_merged.join(foods_sorted.set_index('Area'), on='Area')bh_merged.dropna(how='any', axis=0, inplace=True)bh_merged['Cluster Labels'] = bh_merged['Cluster Labels'].astype(np.int32)bh_merged.head() # check the last columns!
Area
Latitude
Longitude
Cluster Labels
NumberOfFoodPlaces
1st Most Common Food Place
2nd Most Common Food Place
3rd Most Common Food Place
4th Most Common Food Place
5th Most Common Food Place
6th Most Common Food Place
7th Most Common Food Place
8th Most Common Food Place
9th Most Common Food Place
10th Most Common Food Place
0
A'ali
26.154454
50.527364
1
19.0
Café
Restaurant
Coffee Shop
Cupcake Shop
Breakfast Spot
Food
Sandwich Place
Falafel Restaurant
Middle Eastern Restaurant
Bakery
1
Abu Baham
26.205737
50.541668
0
9.0
Middle Eastern Restaurant
Cafeteria
Ice Cream Shop
Donut Shop
BBQ Joint
Restaurant
Fish & Chips Shop
Mediterranean Restaurant
Afghan Restaurant
New American Restaurant
3
Al Garrya
26.232690
50.578110
0
50.0
Restaurant
Breakfast Spot
Indian Restaurant
Coffee Shop
Filipino Restaurant
Café
Fried Chicken Joint
Fast Food Restaurant
Diner
Middle Eastern Restaurant
4
Al Hajar
26.225405
50.590138
0
49.0
Café
Filipino Restaurant
Middle Eastern Restaurant
Fast Food Restaurant
Coffee Shop
Asian Restaurant
Indian Restaurant
Pizza Place
BBQ Joint
Restaurant
5
Al Kharijiya
26.160230
50.609140
0
16.0
Cafeteria
Asian Restaurant
Fast Food Restaurant
Bakery
Wings Joint
Pizza Place
Falafel Restaurant
Middle Eastern Restaurant
Café
Food Court
Finally, let’s visualize the resulting clusters
Make this Notebook Trusted to load map: File -> Trust Notebook
Examine Clusters & Final Conclusion
Now, we can examine & determine the discriminating characteristics of each cluster.
Phew! We’re done with finding our clusters, and finding out which areas fall into it. To understand the constraints and my discussion to conclude this solution, please refer to my report available on my github repo, where you will find the datasets I’ve used :D
I hope you’ve enjoyed reading & learning something new from this post. Doing this was part of my data-science course, and I hope you can do the same with your hobby projects.