Statistics | Pretty Quant

Does a r/WallStreetBets Portfolio Significantly Outperform?

Wed, 05 Oct 2022 00:00:00 +0000

Thanks to the prevalence of COVID-19 in our everyday lives, it’s getting increasingly difficult to return to normal for most people. While I’ve used this additional flexibility to pick up on old hobbies (gaming, music, etc.), others have used theirs to learn about financial markets. Knowing that I work in finance, some of my friends have reached out to me for financial pointers, while others have opted for the convenience of reading r/WallStreetBets.

You may be asking, why am I – someone who would be considered a “sophiscated investor” – would even be interested in a platform such as r/WallStreetBets? For those of you who don’t know, I’ve written a piece about the subreddit last year. However, it was never clearly explained the buzz around r/WallStreetBets.

Due to the pandemic (the financial insecurity and flexibility it brought to millions of people), as well as the stimulus checks provided by the government and the rise of free trading platforms such as Robinhood, a lot of people who would typically not dabble with stocks are having fun with the stock market. They’re also investing in all sorts of zany things like Dogecoin and GameStop. Institutional Investors (the “smart money”) and the veterans in the financial media fail to understand this, and they’re generally condescending and negative towards these new brand of retail investors.

They call them idiots for taking risks in cryptocurrencies; they call them fools for believing companies in dying industries with falling revenues are great investments [2]. From this, the term “Dumb Money” was used to describe this new breed of investors; “DOGE/GME to the moon,” they frequently chant, much to the disdain and confusion of legacy investors and their friends in the legacy media [3].

Recognizing the condescension, these retail investors decide to take their agency back using the self-described label known as Retards. Retards, if you don’t know, is a rearrangement of the word tradeRS. Since they’re not considered legitimate tradeRS in the eyes of the investment community, they’ll just call themselves Retards, which is an anagram designed to reclaim the agency taken from retail investments. Sure, they may be considered “Dumb Money,” but they’re going to make the investment decisions they want to without the influence and manipulation of institutional investors and their friends in the financial media.

This is largely the energy behind drama involving r/WallStreetBets and the rest of the investment community.

Are r/WallStreetBets Stock Picks Even Any Good?

It is generally assumed – rightly or wrongly – that if you have a background in finance, you know what you’re talking about. The barriers required to work within the industry seem to justify the claim. The most well-known front-end finance jobs require a bachelor’s degree at an accredited four-year university, along with passing, at minimum, the Securities Industry Essentials Exam (or SIE) and either a FINRA Series 3 or 7 Exam.

Jobs that are more analytically driven, such as Actuaries, may require candidates to have a statistical or mathematical background and pass several SOA (Society of Actuaries) Exams. While Quants (which is my domain) typically don’t require examinations; however, some positions do encourage and require candidates to have at minimum a Masters of Science in a STEM field.

So yes, it may be easy to see why people on Wall Street are considered “Smart Money.”

However, this doesn’t mean you need education and fancy certificates to make good investments. Warren Buffett, one of the greatest investors alive, began to invest on his own when he was only 11-years-old. While the man would never even look at any of the companies r/WallStreetBets are investing in, he established a system that allowed him to make sound investments using the resources available at the time; namely, a book published by Benjamin Graham called The Intelligent Investor. Speaking from personal experience, I started learning about finance and economics on my own time before I enrolled in university to pursue it as a career.

Today, the resources available to help retail investors are potentially endless. Most of what you’ll find on the internet is bunk; however, you can find invaluable information if you know where to look.

This project aims to see if the TradeRS at r/WallStreetBets know where to look. Are they seeing things we aren’t seeing or just larping as wall street speculators?

What Are “Meme Stocks”?

A meme stock is a stock that has seen an increase in volume not because of how well the company performs, but rather because of hype on social media and online forums. For this reason, these stocks often become overvalued, seeing drastic price increases in just a short amount of time.

Many of these stocks have not performed well in recent years. Some of these stocks may exist in struggling retail or brick-&-mortor (GME) industries. Other stocks may have once been considered leaders in their respective industries but have rebranded and shifted their focus to maintain viability (BBY). However, one common thing among these meme stocks is that they all follow the same life cycle.

“To the moon” was a popular rallying cry for many holders of these stocks. It was used as a reminding, regardless of how much the price may drop, buy and never sell, because we (retail investors) control the value of the stock, and not institutional investors. There is no doube that it can be exciting to make money day trading and to be part of something bigger than yourself.

Unfortunately, there is still a large body of research that suggest that even the most experienced of day traders lose money [4].

Building a “Meme Stock” Portfolio

We will construct a portfolio using popular stocks from the WallStreetBets community. In July 2021, I decided to find the most talk-about stocks in the r/WallStreetBets subreddit and narrowed the list down to 20 of the most popular equities on the platform. Six of these assets have recently gone public, and they will be excluded from the experiment.

We’re also going to include popular cryptocurrencies, such as Bitcoin and Dogecoin.

Seeking Alpha

Alpha, or Jensen’s Alpha, quantifies the excess returns obtained by a portfolio of investments above the returns implied by the Capital Asset Pricing Model (CAPM).

The formula is denoted as follows:

$$r_{\alpha} = r_{f} + \beta_{\alpha} * (r_{m}-r_{f})+\epsilon$$

where:

$r_{f}$ = Risk Free Rate
$\beta$ = Beta of a security
$r_{m}$ = Expected market return
$\epsilon$ = Tracking error

This formula can be better understood if we refactor the formula as seen below:

$$(r_{\alpha}-r_{f})=\beta_{\alpha}*(r_{m}-r_{f})+\epsilon$$

The left side of the equation gives us the difference between the asset return and risk-free rate, the “excess return.” If we regress the market excess return against the asset excess return, the slope represents the asset’s beta. Therefore, beta can also be calculated by the equation:

$$\beta=\frac{Cov(r_{a},r_{b})}{var(r_{b})}$$

So beta can be described as:

$$\beta=\rho_{a,b}*\frac{\sigma_{a}}{\sigma_{b}}$$

The formula above shows that beta can be explained by the correlated relative volatility. To make this simplier, beta can be calculated by doing a simple linear regression which can be viewed as a factor to explain the return, and the tracking error can represent alpha.

The value of alpha – the excess returns – can vary. A positive number signal’s overperformance relative to the benchmark, while a negative number signals underperformance. Zero (or a number close to zero) shows a neutral performance; the fund tracks the benchmark.

The CAPM formula utilizes the risk-free rate to account for risk. Therefore, if a given security is fairly priced, the expected returns should be the same as the returns estimated by CAPM. However, if the security were to earn morethan the risk-adjusted returns, the alpha should be positive.

Descriptive Statistics (Risk & Return)

So how does our meme stock portfolio perform?

As we can see, most of the stocks in our portfolio outperform the S&P 500 by a modest margin. Out of the 20 assets, 8 of them outperform our benchmark at least 1%, and 3 have underperformed our benchmark. If we average the , it comes out to 0.9386, which means that our meme stock portfolio only outperforms the S&P by 0.93% (I got this figure by simply averaging the for each asset). So, should you invest in a portfolio like this? I suppose it “depends” on your income goals and overall suitability (after all, this isn’t investment advice).

If I’m a college student, probably majoring in economics/finance with an interest in quant finance, and I’m just experimenting with investment strategies, I might be grateful that my strategy is at least slightly better than the S&P. However, if I’m a grown adult looking generate wealth, I don’t think I would be satisfied with 0.93%, which barely accounts for management fees.

Granted, those who consider meme stocks a sound investment are usually self-taught and were first introduced to investing during the AMC/GME/DOGE craze. Needless to say, they are probably managing their own portfolios (no management fees for them). It may be difficult to quantitatively comprehend the idea of an 0.93, especially when so many assets can perform much better.

As mentioned before, most assets (especially stocks) are positively correlated with the broader market. Most of the assets in our portfolio have $\beta$ greater than 1, which means they are more volatile than the overall market. As we see, CLF has a $\beta$ of 1.7, which means we can expect this stock to increase by 1.7% for every 1% increase in the broader market.

Portfolio Optimization

We have shown that our meme stock portfolio outperforms the S&P 500 by 0.93%, but that doesn’t mean it will typically outperform our benchmark by this amount. This amount can change, based on the size of our portfolio, as well as how we weigh each asset. We can figure out how to best do this, by utilizing the Modern Portfolio Theory (MPT). The MPT is a method for selecting investments in order to maximize their overall returns within an acceptable level of risk.

Essentially, we are trying to find the most efficient portfolio possible.

How do we measure the efficiency? We measure it with another formula known as the Sharp Ratio. The Sharpe ratio measures the performance of an investment compared to a risk-free asset, after adjusting for its risk. The sharpe ratio can be calculated by the following:

$$Sharpe Ratio = \frac{R_{p}-R_{f}}{\sigma_{p}} $$

The greater the Sharpe ratio, the better, as it indicates that an instrument’s returns are large relative to its risk. Also, the greater the Sharpe ratio, the higher the earnings on average than the risk-free rate.

Now we’ve allocated all 20 of our assets based on our risk tolerance. As you can see, our program has plotted 100,000 unique portfolios, with annualized returns ($\alpha$) on our y-axis and annualized volatility ($\beta$) on our x-axis. It’s possible to select a random portfolio inside the curve, but there will always be some portfolio out there, with the same number of assets, that will outperform in terms of returns and risk. The optimal or efficient portfolio will always exist somewhere on the edge of the curve, hence, the efficient frontier.

If we want the least about of risk possible, using our selected 20 instruments, we should allocate most of our funds towards BBBY, RNG, TWNK, TX, USA, X, HCA, VRTX, and BTC. These 9 instruments will comprise 77.4% of the portfolio. By allocating the portfolio in this way and prioritizing risk, we will achieve an excess annualized return (our $\alpha$) and volatility of 0.3.

What if we only care about maximizing return? If we want the greatest return possible, using the same 20 instruments, we should allocate most of our funds towards AMC, CLF, GME, BB, TWNK, SAVA, HCA, BTC, and DOGE. These assets will comprise ~75% of our portfolio. Using this optimal portfolio allocation which prioritizes returns, our portfolio achieves an annualized return of 0.84, with a volatility of 0.51.

So compared to the descriptive statistical analysis we’ve used earlier, we’ve actually OVERSTATED our return for this portfolio. Using the most optimal allocation method possible, with thousands of different possibilities, we find that our meme stock portfolio still barely outperformes our benchmark.

In practical terms, it would be difficult for a serious investor to justify building a portfolio with these 20 assets when there are so many different investments out there that could do significantly better for the least amount of risk. This is especially true if you’re trying to avoid investing in assets that appear to be overvalued, such as Bitcoin and GameStop.

Lessons From The Meme-Stock Craze

I doubt this analysis will convince anyone apart of r/WallStreetBets, TradeRS, or anyone sympathetic to the meme-stock trading “revolution.” Much has been written about the heroic campaign by individual investors to slay giant institutional investors. It’s easy to understand why this narrative is compelling, so I doubt anyone would want to listen to someone with institutional experiences, such as myself. Regardless, there are still valuable lessons that can be learned from the meme-stock craze, and the effort to democratize financial markets misses the market on many of these lessons.

First, it’s dangerous for investors to follow crowds in stock markets or any need. r/WallStreetBets initiated campaigns to inflate asset prices past their intrinsic value or their long-term fundaments, making opportunities such as GameStop, AMC, and Bed, Bath & Beyond appear to be attractive investment opportunities. But those who bought these stocks close to their peaks are already nursing losses as the shares have come down. Overpaying for stock prices that don’t reflect business fundamentals isn’t courageous. Many who bought into the hype are already learning this painful lesson on the risk of market fads.

Second, the Federal Reserve has played an unwitting role creating an enviroment where the meme-stock challenge can happen. In the Fed’s efforts to stabilize the economy, money has become virtually free. Ultralow interest rates encourages people to borrow and to take bigger risk to seek better returns. As a result, we’ve seen record sums of money being pumped into SPACS (special purpose acquisition corporations) and private equity funds. As more money chases more opportunity, there are more instances of companies coming to market without being fully-vetted.

Third, the cheap money also fuels record retail trading activity. We’ve seen this movie before, and it rarely ends well; for those of us who remember the dot-com frenzy in the late 1990s and the mortgage bubble that led to the 2008 financial crisis. With the Fed increasing interest rates to combat persistently high inflation, they will have to reluctantly deflate any bubbles that have emerged. We’re venturing onto uncharted territory, where our rookie investors now have to learn to generate alpha in an environment without cheap money and a zero-lower bound.

As such, quality is the best recipe for returns. Focusing on high-quality companies is a good defence against irrational market moves. And companies that enjoy strong organic growth drivers aren’t beholden to the hypercompetitive M&A market for growth. Building an equity portfolio based on businesses with sustainable earnings growth is a recipe for consistent outperformance and reduced volatility, even in a world where smaller investors can mount powerful campaigns to shock market leaders.

[2] The New York Times | ‘Dumb Money’ is on GameStop, and It’s Beating Wall Street at Its Own Game

[3] Quartz | Reddit and Robinhood gamified the stock market, and it’s going to end badly

[4] Hass School of Business, University of California Berkley | Do Day Traders Rationally Learn About Their Ability?

What Are Statistical Moments?

Wed, 16 Feb 2022 00:00:00 +0000

Sometimes mean and variance are not enough to describe a distribution. When we calculate variance, we square the deviations around the mean. In the case of large deviations, we do not know whether they are likely to be positive or negative. This is where the skewness and symmetry of distribution come in. A distribution is symmetric if the parts on either side of the mean are mirror images. For example, the normal distribution is symmetric. The normal distribution with mean, $\mu$, and standard deviation, $\sigma$, is defined as

$$f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}$$

Plotting the distribution gives us the following:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# Plot a normal distribution with mean = 0 and standard deviation = 2

plt.figure(figsize=(12,9))
xs = np.linspace(-6,6, 300)
normal = stats.norm.pdf(xs)
plt.plot(xs, normal);
plt.grid()
plt.xlim(-6,6)
plt.ylim(0,.42)

A distribution which is not symmetric is called skewed. For instance, a distribution can have many small positive and a few large negative values (negatively skewed) or vice versa (positively skewed), and still have a mean of 0. A symmetric distribution has skewness 0. Positively skewed unimodal (one mode) distributions have the property that mean > median > mode. Negatively skewed unimodal distributions are the reverse, with mean < median < mode. All three are equal for a symmetric unimodal distribution.

The explicit formula for skewness is:

$$S_{K}=\frac{n}{(n-1)(n-2)}\frac{\sum_{i=1}^{n}(X_{i}-\mu)^{3}}{\sigma^{3}}$$

when $n$ is the number of observations, $\mu$ is the arithmetic mean, and $\sigma$ is the standard deviation. The sign of this quantity describes the direction of the skew as described above. We can plot a positively skewed and a negatively skewed distribution to see what they look like. For unimodal distributions, a negative skew typically indicates that the tail is fatter on the left, while a positive skew indicates that the tail is fatter on the right.

plt.figure(figsize=(12, 9))
xs2 = np.linspace(stats.lognorm.ppf(0.01, .7, loc=-.1), stats.lognorm.ppf(0.99, .7, loc=-.1), 150)

# Negatively skewed distribution
lognormal = stats.lognorm.pdf(xs2, .7)
plt.plot(xs2, lognormal, label='Skew > 0')

# Positively skewed distribution
plt.plot(xs2, lognormal[::-1], label='Skew < 0')
plt.legend(framealpha=0.1, loc='lower right');
plt.ylim(.004, .75)
plt.xlim(0, 5)
plt.grid()

Although skew is less obvious when graphing discrete data sets, we can still compute it. For example, below are the skew, mean, and median for S&P 500 returns 2018-2020. Note that the skew is negative, and so the mean is less than the median.

end = datetime.date(2018,1,1)
beg = end - relativedelta(years = 3)
pricing = pd.DataFrame(yf.download('SPY', start=beg, end=end))['Adj Close']

returns = pricing.pct_change()[1:]

pricing
print('Skew:', stats.skew(returns))
print('Mean:', np.mean(returns))
print('Median:', np.median(returns))


fig, ax = plt.subplots(figsize=(12,9))
plt.hist(returns, 30, edgecolor='black');
ax.grid()
ax.set_axisbelow(True)

Kurtosis

Kurtosis attempts to measure the shape of the deviation from the mean. Generally, it describes how peaked a distribution is compared the the normal distribution, called mesokurtic. All normal distributions, regardless of mean and variance, have a kurtosis of 3. A leptokurtic distribution (kurtosis > 3) is highly peaked and has fat tails, while a platykurtic distribution (kurtosis < 3) is broad. Sometimes, however, kurtosis in excess of the normal distribution (kurtosis - 3) is used, and this is the default in scipy. A leptokurtic distribution has more frequent large jumps away from the mean than a normal distribution does while a platykurtic distribution has fewer.

plt.figure(figsize=(12,9))

# Plot some example distributions
plt.plot(xs,stats.laplace.pdf(xs), label='Leptokurtic')
plt.plot(xs, normal, label='Mesokurtic (normal)')
plt.plot(xs,stats.cosine.pdf(xs), label='Platykurtic')

plt.xlim(-6,6)
plt.ylim(0,.52)
plt.legend(framealpha=0.1, loc='upper right');
plt.grid()

The formula for kurtosis is

$$K=\bigg(\frac{n(n+1)}{(n-1)(n-2)(n-3)}\frac{\sum_{i=1}^{n}(X_{i}-\mu)^{4}}{\sigma^{4}}\bigg)$$

while excess kurtosis is given by

$$K_{E}=\bigg(\frac{n(n+1)}{(n-1)(n-2)(n-3)}\frac{\sum_{i=1}^{n}(X_{i}-\mu)^{4}}{\sigma^{4}}\bigg)-\frac{3(n-1)^{2}}{(n-2)(n-3)}$$

For a large number of samples, the excess kurtosis becomes approximately

$$K_{E}\approx\frac{1}{n}\frac{\sum_{i=1}^{n}(X_{i}-\mu)^{4}}{\sigma^{4}}-3$$

Since above we were considering perfect, continuous distributions, this was the form that kurtosis took. However, for a set of samples drawn for the normal distribution, we would use the first definition, and (excess) kurtosis would only be approximately 0.

Analysis of Airbnb Listings in Brooklyn

Thu, 27 Apr 2017 00:00:00 +0000

The purpose of this project is to use Airbnb listing data and Zillow value data to provide marketing and business recommendations to Airbnb. This will be achieved through exploratory data analysis and predictive models for Airbnb listing price.

The first part of the project is dedicated to taking visual into Airbnb listings in the Greater New York area to discover trends in listings price, occupation, rating, value, etc. The regions we will primarily focus on will be Brooklyn. This reasons for this is because I personally have an Airbnb listing in the Crown Heights area. We’re looking for insights into listing behavior that could help Airbnb make strategic business decisions. The second part of the project introduces the Zillow data and the process of creating a predictive model for listing behavior that uses both data sets.

The Data

Throughout this project, we will be using data from Airbnb and Zillow. The Airbnb data was obtained through the third party platform, Inside Airbnb, which scrapes listing data regularly. All the listing data used in this project is from the July timeframe of 2019. Zillow has its home value index available on its website. We will be using home value data from 2019.

Airbnb Data

We begin by importing and cleaning the raw Airbnb listing data. The columns or features selected to be used in the analysis:

ID: This is a unique identifier for each listing.
Room Type: Defines the layout type of the listing as either: ‘Single Room’, ‘Shared Room’ or ‘Entire Home/Apartments’.
Accomodates: The number of people that each listing can accomodate.
Bathrooms: The number of bathrooms in each listing.
Bedrooms: The number of bedrooms in each listing.
Availability 365: The number of days that a listing is available to be booked in the next 365 days.
Number of Reviews: The number of reviews for each listing.
Review Score, Accuracy: This is the average accuracy rating for each listing. The user rates the listing based on how accurate the listing was as compared to the posting online.
Review Score, Value: This is the average value rating for each listing. The user rates the listing based on its value.

Exploratory Data Analysis

We begin by analyzing the listing count in the area. Looking at this normalized data, while it is true that Williamsberg and Bedford-Stuyvasant have a large amount of listings, when divided by the population, we see that the listing count is very dependent on the population. Now there are some more interesting points to extract.

Where are the most popularly places to stay?

Williamsberg and Kensington have the highest normalized listing count. These areas out rank the front-runners of the un-normalized list, Bedford-Stuyvasant and Crown Heights, which follow.

But what does this normalized listing count actually indicate? This count tells us the number of listings each region would have disregarding the population. In other words, it tells us the number of listings there would be in each region if they all had the same population.

The higher a region ranks in normalized listing count, the more host are listing there homes on the Airbnb website compared to other. Thus, a higher normalized listing count indicates that there is a higher supply of listings in those cities. This high supply mat also suggest that these are areas where the Airbnb market is strong in terms of hosting listings.

This would also mean that the cities with a lower normalized listing count are ones where Airbnb is not as popular from a supply or host side of the service. In areas such as DUMBO (Down Under the Manhattan Bridge Overpass) and Boerum Hill, there may be some regulation prohitbiting the amount of listings available in these areas.

Let’s look at what happens when we pair this ‘supply’ side data with a parameter that estimates demand. ‘Demand’ of a listing in the context of this data can be expected by the popularity of the region to visit. This is indicated by the availability of a listing in the next calendar year. Plotting the mean availability (demand indicator) against the normalized listing count (supply indicator), we can explore the relationship. I have removed Williamsburg and Kensington, as they are clearly outliers.

We take our mean availability metric (which measures the average number of days each listing is available to book in the next calendar year) and subtract 365 to get the number of days the average listing is booked in each area. A higher number indicates that the region city is more popular, and therefore, has more ‘demand.’

There seems to be a general trend of increasing supply of listings. As the area becomes more popular, we can expect more listings to appear in the area. This implies that Airbnb is doing a relatively good job encouraging individuals to host their homes as listings, given the demand of travel in those cities.

Where are the most expensive/cheapest places to stay?

Mill Basin, DUMBO, and Brooklyn Heights are the most expensive areas to stay in Brooklyn, while Borough Park, Brownsville, and Gravesend are the least expensive. These results are interesting but difficult to explain why the areas fall in the order they do. Is it because of supply and demand? This average price metric is more interesting when we pair it with other data sets and explore relationships.

Price and listing count have an obvious correlation. As the normalized number of listings increase, price increases as well. We run the correlation coefficient between these two variables, we get 0.6154, which is an indication of a strong positive relationship. So as the supply of listings increases, it does not necessarily mean that the increased competition due to increased supply drives down price.

There are places where there is wide dispersion in the price for a similar listing count, and the relationship is not as strong. Fort Greene and Clinton Hill all have normalized listing counts but fall into two distinct price groups. Crown Heights and Cypress Hills have lower average prices. This may be explained by the difference in average size or quality of listings in these two areas compared to the others.

For explain, since Fort Greene and Clinton Hill are larger, host there may live in smaller residences. This limits the number of guests they can accommodate, as well as the bedrooms and bathrooms of their listings, thus driving down the average listing price in these areas. Clinton Hill/Fort Greene is also more commercial than Crown Heights/Cypress Hills, which are more residential.

It’s also worth noting that Williamsburg, Kensington are outliers again, along with Mill Basin. As such, we have removed these three areas for the analysis.

Which areas are the most popular to visit and stay?

There are a few metrics I will use to estiamte the average popularity of listings per area. The first metric I will analyze is the average number of days the listing of a city are booked in the next year. The higher the average number of booked days, the more popular the area’s listings are. The most popular travel distinations in Brooklyn are the Brooklyn Heights/Downtown Brooklyn area. This result is not surprising, because they are highly developed areas around the city (Fulton Street Mall, Barclays Center, etc.).

The second metric we use to determine the popularity of listings in a particular area is the average number of review per city. We use the number of reviews as a proxy for estimating how frequently the listings are booked. The reasonsing is that the more reviews the listing has, the more times it is booked, and therefore, the more popular it is. Here, we are using the average number of reviews for each area’s listings.

There are some similiarities in how the areas rank in both metrics. Bedford-Stuyvasant jumps from the bottom of the list on availability to the top on average number of reviews. So although Bed-Stuy is not the most popular place for travelers relative to other areas included in the data set, it is well reviewed, all on average. This may be an indication of the superior value of Bed-Stuy Airbnb listing, or it may be an indication of the relative affordability.

Next, we plot average availability per area versus its average price to get an idea of how popularity and price are related. This time, only Mill Basin is a major outlier with a high price. It is also the most unpopular area to stay, with only 70 days booked in the next calendar year on average. Mill Basin only has 5 listings, according to the most recent analysis by InsideAirbnb. As such, Mill Basin is not meaningful for our analysis.

Since Mill Basin doesn’t fit the trend of the rest of data, we drop it from our analysis. After doing so,there is a positive correlation between price and popularity, with a correlation coefficient of 0.582. This coefficient means that, on average, the more popular a listing (higher the number of days booked) the higher the price of the listings. This seems to suggest that areas where it is more popular to travel have higher demand, which drives up price, on average.

The Zillow Data

Unlike commodities and consumer goods, for which we can observe prices in all time periods, we cannot observe prices on the same set of homes in all time periods. This is because not all homes are sold in every time period. Zillow has developed a way of apprioximating the ideal home price index. Instead of actual sale of prices on every home, the index is created from estimated sale prices on every home.

Because of this,the distribution of actual sale prices for homes sold in a given time period looks very similiar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices, not just for the homes that sold, but for all homes even ig they did not sell in that time period.

The Data

From the Zillow website we have important data on the ZHVI or median home price of every area. This data includes the growth rate of the index over various periods of time including: Month-over-Month, Quarter-over-Quarter, Year-over-Year, the 5-year change and the 10-year change.

Airbnb listings and ZHVI Data Relationship

The average price of a listing seems to increase as the median home value increases, as measured by the Zillow Home Value Index. However, Mill Basin and South Slope are two outliers, with low median home values and high average listing prices. By dropping these two outliers, we increase our correlation coefficient between ZHVI and average list price .341 to 0.494. This confirms that there is a very strong relationship between the median home value and averaqge list price.

Machine Learning: Regression Model for Price Prediction

Feature Selection

We begin by appending all the listing data from these areas into one table. The next step is adding on the appropriate Zillow home value data for each listing based on the zip code for the listing. We will merge the two data sets using the zip code information contained in both data sets. That way, more granular and accurate home value data will be paired to each listing, as opposed to just using the general region name.

Setting the index to the region name for each data frame and merging on this index, we create a new table such that each listing has its relevant ZHVI 5-Year ZHVI, and 10-Year ZHVI appended to its Airbnb data. This data frame has all of the independent data and dependent data (price) necessary to fit the model. If a linear relationship is not present, some features may require a transformation.

We evaluate linearity by plotting the price of all listings against each feature to ensure it has a linear relationship. The only two variables that needed transforming are the number of reviews and availability. Price exponentially decreases as number of reviews increases.

After taking the log of the number of reviews, a linear relationship emerges between these variables. With availability, there was an exponential increase in price as the availability increasese. Again, taking the log of availability, a linear relationship emerges between these variables. Also, the sum of all ratings was taken and used as a feature as this variable had a stronger relationship to listing price.

Another requirements in creating a linear model that are indepednent variables, or features, do not have strong relationships with one another, as this violates the assumption of the linear model. To check for this, we use a correlation matrix.

Relationships between variables are measured between -1 and 1. The higher the absolute number, the stronger the relationship, with the sign signifying the direction of the relationship. As a result of this, bedrooms were removed as a feature, as it had a very strong correlation with bathrooms, occupancy, and ZHVI.

Linear Model

Using the feature DataFrame with all of the chosen independent variables and our dependent variables, price, we split the data into training and testing sets, with 80% of the model used to train the data and 20% used to evaluate its performance.

It uses the model created by the training data to predict price using the listing features of the testing data. The model then compares this fitted data (or predicted prices) to the actual price of each listing to determine the strength of the model. If the model is strong, then the different between the predicted price on the test data (using the model) for a listing will be very close or exactly the same as the actual price of the listing.

The R-squared is the coefficient of determination and measures, as a percentage, how well our linear model fits the data, or how close the training and test data are to the fitted regression line. In the context of Airbnb data, how well the model will do at predicting the price of a listing based on our selected features.

The R-squared for this initial model was 0.538 for the testing data. This coefficient of determination tells us that the features can explain about 53.8% of the total variance in the price of a listing. At 50%, half of the variability can be attributed to our model. This is reasonable, but we may be able to achieve better results using other alorigthms for linear regression, by changing the number of features or type of areas we include in our model. The performance on the test data was similiar, which is another good sign for the data. The Mean Squared Error for the testing set is 3676, which may be a cause for concern. However, this is something that can be corrected by refitting the model.

We will use mean squared error along with the R-Squared to compare the performance and linear fit to others as we try to improve the model. There is definitely room for improvement, especially with MSE and R-squared.

The residuals vs. fitted plot above described the difference in the actual listing price and predicted listing price using the model. The residuals should be distributed randomly above and below the axis trending to cluster towards the middle without any clear patterns. These qualities are indicative of a good linear model, with a normal distribution. Normality in the residuals is important, as it is an assumption in creating a linear model. The model seems to the heteroskedastic, meaning that the residuals get larger as the prediction moves from large to small.

The model is better at predicting prices for average priced listings at around 3

Improving the Model: Predicting Price for Highly Popular Regions

We may be able to achieve a better prediction model and higher R-squared if we segment the cities by a strong variable such as Availability. The idea is that we will have a higher fit by putting similar listings together, resulting in a stronger relationship. Recall from the exploratory analysis above that the most popular areas are those that have the smallest percent average availability in the next year.

We will segment the “highly popular (more than 250 days booked in the next 365 days) from the rest of the samples of cities. We might be able to predict price better for more popular areas. The highly popular areas we used to fit the new model are: Downtown Brooklyn, Navy Yard, Cobble Hill, Sea Gate, Brooklyn Heights, Columbia St, Carroll Gardends, Williamsburg, Boerum Hill, Prospect Heights, Greenpoint, Vinegar Hill, Windsor Terrance, South Slope, Kensinton, Park Slope, Fort Greene, Flatbush, Clinton Hill, Prospect-Lefferts Gardens, Bushwick, Crown Heights, Red Hook, Midwood, Sunset Park, DUMO, and Bedford-Stuyvesant.

This model’s coefficient of determination tells us that the linear regression model can explain around 65.1% of the total variance in the price of listing from a popular area. At 65%, almost two-thirds of the variability can be attributed to our features. This is much higher than our original linear model using all areas. This result proves that there is stronger linearity for popular areas in Brooklyn.

Random Forest Regression for Popular Areas Model

Next, we will be using the Random Forest Regression model on our popular cities data to see if it results in a better fitting model. The Random Forest model uses an algorithm which ‘bootstraps’ (takes of many random samples from the training data) to build the ’nodes’ of a decision or regression tree that models the behavior of the data to create a linear model.

After implementing the Random Forest Regression, the fit doesn’t improve significantly from the highly popular sample set. The model has an R-square of 55.6%; however, we did manage to reduce the MSE.

Conclusion

The linear regression model reveals some interesting attributes regarding the relationship between price and various characteristics of listings, which can be leveraged to help business strategy for Airbnb.

There is a big difference in the way the appreciation of home values impacts the price of an Airbnb listing over a 5 year and 10-year timeframe. This trend is present in all the models we looked at, but let us use our ‘Popular Areas’ model as an example here (the second model we created). Net of all other variables, the price of a listing increases 45 dollars for every one-point increase in home value over five years. This is a stark contrast to the -301 variable of the same measure over 10 years. This suggests that as home values increase over a larger period of time, the price of listing value, which spikes supply and drives down prices.

The 10-year figure may not be an accurate way to analyze the listing value, as the real estate market can change dramatically in this time. The 5-year may be a more reasonable indicator. Given that most listings are around 100 dollars a night, the fact that home values can impact the price by 45 dollars seems like a significant variable to investigate.

In terms of listing variables, the size of the listing seems to be one of the biggest influences on the price of a listing. The number of people the listing accommodates the number of bathrooms is some of the largest regression coefficients in the model. This relationship demonstrates how hosts and guests value listings. A away Airbnb can use this data could be a way to control the average price listings in an area. If they want to create more listings at lower prices for certain areas. Airbnb could make host post listings that accommodate fewer people. This relationship suggests that it may be a good next step to segment the data based on the size of listings to determine if they are more relationships to explore.

The model and exploratory analysis conducted in this project provided an exciting look into the behavior of Airbnb listings for popular travel destinations in Brooklyn. However, the model and analysis could be greatly improved upon with data from more cities. Improved correlations between variables and clearer relationships could be established with more listing information from a wider variety of cities. The conclusions established from this analysis would also be more significant. Another expansion of this analysis would be to take it beyond major cities and travel destination in Brooklyn. The data used here does not provide a look into the nature of the Airbnb market in mid/smaller areas of Brooklyn. There could be an untapped market for short-term rentals or local travel in these areas that should be explored.

Sources

[1] Inside Airbnb. Accessed at:
http://insideairbnb.com/get-the-data.html

[2] Zillow Home Value Index. Accessed at:
https://www.zillow.com/research/data/

GitHub Code

Full github code with explation here