Final Project for CPLN 505 – Planning by Numbers, Spring 2017
Predicting Capital Bikeshare Trips by LowIncome Users Introduction Capital Bikeshare (CaBi) is a bike sharing system serving the Washington, DC region. The system has more than 3,700 bikes at 439 stations spread out across five jurisdictions: Washington, DC (249 stations); Montgomery County, MD (66 stations); Arlington County, VA (85 stations); Fairfax County, VA (29 stations); and Alexandria City, VA (32 stations). Since its inception in September 2010, CaBi’s 31,000+ members have taken 15.4 million trips on the system. Annual membership to the system costs $85, but riders can also opt to pay by the ride or by the day. Annual members are allowed 30 minutes per ride before being charged overage fees, while single-use riders are allowed 60 minutes per ride before being charged overages. Per the 2016 CaBi Member Survey Executive Report, 80% of all rides on Capital Bikeshare are taken by registered members.1 The same report found that CaBi riders tended to be “considerably younger, more likely to be male, Caucasian, and slightly less affluent,” compared with all commuters in the region (this author posits that the lower income levels of CaBi riders are most directly related to their younger ages). Though the system has been a success in terms of financial solvency and growth rates, the 2016 Member Survey showed that ridership is lacking amongst some of DC’s lessprivileged communities. To grow CaBi’s ridership into underserved populations, the system launched a new membership program in January 2017, called the Community Partners Program (CPP), co-sponsored by Capital Bikeshare and a handful of participating local non-profit organizations. Participants in the CPP program receive an annual membership for just $5, and are allowed rides of 60 minutes before being charged overage fees. To date, CPP has about 200 participants (though only 76 appear in the data set analyzed). The program is currently limited to residents of Washington, DC and Arlington County, VA. This analysis first aims to create a multivariate linear regression model to predict the number of bikeshare trips per bicycle dock that originate from a given census tract. The analysis then aims to assess whether this linear regression model is a good fit for predicting trips by CPP riders by bicycle dock that originate from a given census tract. The analysis utilizes data from the DC Department of Transportation on duration and https://d21xlh2maitm24.cloudfront.net/wdc/CapitalBikeshare_2016MemberSurvey_Executive-Summary.pdf?mtime=20170303165533 1
start/end points for trips on Capital Bikeshare during February and March 2017, along with demographic data from the US Census.
Summary Statistics Before creating the regression models, this analysis first looked at summary statistics for the data set. Key findings include: •
There were 471,569 trips on CaBi during February and March of 2017
Of those trips, 1,567 (or 0.03%) were by CPP members
CPP members took trips from only 192 of 439 total stations
Each CPP member in the data set took an average of 10 trips during February and March 2017, with a maximum of 112 trips by a single CPP member
93% of trips by CPP members originated in DC, compared with 6% in Virginia and only 1% in Maryland. (Fig. 1)
This analysis then compared rides by CPP members to rides by Registered members and Casual members (single-ride or one-day pass) using a series of t-tests of means to determine differences between each subset (Fig. 2). These t-tests showed that CPP members tend to take rides of longer duration, but of a shorter average distance (for all trips starting and ending in different locations) than trips taken by registered members.2 CPP riders are also more likely to start and end trips at the same station than Registered members, suggesting that they are more likely to use the system for recreation.
Distances were calculated assuming a straight line between the start and end stations, therefore these distances are not fully indicative of the actual distance traveled by the rider. Determining distance using observed GPS data from the ride in question, or as the shortest distance along the street network, would both be superior measures of distance for future studies. 2
Methodology The main source for this analysis was data on Capital Bikeshare trips, downloaded from the DC Open Data website. This dataset includes information on the start and end points of every trip, as well as the tripâ€™s duration (rounded to the nearest second, to match the format of the CPP data). Each trip was aggregated to its corresponding census tract of origin using ArcGIS. This analysis was performed at the tract level rather than the station level due to a lack of sufficient data for CPP rides, as rides by CPP members have originated at fewer than half of all CaBi stations during the two months analyzed here. Based on the results of the summary statistics, this analysis compared trips by CPP members to trips taken by Registered members (excluding trips by CPP members), as Registered trips were far closer in average duration and distance than trips by Casual users, and were thus most likely to be predictive of trips by CPP members. This analysis also utilized spatial information on the location of Capital Bikeshare and Metro (subway) stations from DC Open Data. ArcGIS was utilized to calculate key variables including: the number of CaBi stations within the tract and within Â˝ mile of the tract; the number of bicycle docks within the tract; whether the tract has a Metro station; and the mean distance and duration of the trips originating from each tract. Information for rides by Capital Partners Program members in February and March 2017 was provided by Kim Lucas of the DC Department of Transportation, and includes the same information as was included in the system-wide bikeshare trip data set. The bikeshare trip data sets were then combined with demographic data from the 2015 American Community Survey 5-year estimates. This included information on: age of population; mode of commute to work; household income; poverty; and number of students living in the tract. Variables were interacted with each other to produce operationalized independent variables including: population density; percent of commuters by mode (walk, bike and public transit); and the ratio of males to females. For
the dependent variable, the number of trips originating from each census tract was divided by the number of docks in that tract, to produce a metric for the number of trips created per station dock. Normalizing by the number of docks helps to account for the fact that more trips are likely to originate where there are more potential bikes. The variables considered for the final model are included in Fig 3. Fig. 3:
The analysis proceeded with pairwise correlation tests between potentially relevant independent variables and the number of trips per dock to identify variables for testing in the final model, as well as to identify collinearity between independent variables (Fig. 4). The pairwise correlation tests revealed several variables lacked a statisticallysignificant correlation to the number of trips per dock, including population in 2015, the number of students, and the average speed. Pairwise correlation tests also revealed that significant collinearity between the number of docks, number of bikeshare stations within a census tract, and the number of bikeshare stations within Â˝ mi of the census tract, meaning that only one of these three variables should be considered in the final model. The highest correlations to the number of trips per dock were found for the number of bikeshare stations within Â˝ mi of a census tract (0.80) and the percent of workers who walk to work in the census tract (0.67).
The independent variables that showed correlation to the dependent variable (number of non-CPP trips per dock by census tract) were included in a backward selection model to identify the variables with the strongest collective associations to the dependent variable. The intercept was suppressed on the model, to match the reality that no census tract starts with an inherent (negative) number of trips. The result of this model is the first model shown in Fig. 5. The backward selection model had an impressive r2 value of 0.8815, meaning it explained about 88% of the bikeshare trips originating from a given tract. However, the significance of some of the variables in the model were in doubt. Therefore, this analysis continued by manually removing individual variables to further explore the relationship between individual variables and the number of trips per dock, starting with the variables with the smallest standardized coefficients (by absolute value). Any variables that did not significantly affect the r2 value positively or negatively were removed from the model, including median income, percent males, the sex ratio, and the number of residents between ages 18 and 24. The number of CaBi stations within a half mile of the census tract was added back into the model to control for induced demand of other nearby
stations, based on the high correlation between this variable and the number of trips per dock identified by the pairwise correlation tests. The 95% confidence intervals for coefficients were calculated in Excel using the TINV function. The best model for predicting trips by non-CPP riders at the tract level is the second model shown in Fig. 5. The key question in this analysis is whether the data for non-CPP trips was a good predictor of the number of CPP trips originating in the tract. Thus, upon settling on a best model to explain non-CPP trips, the analysis used the same variables to predict a model for the number of CPP trips per dock by census tract. To compare coefficients between models, the number of CPP trips was scaled by the ratio of non-CPP trips to CPP trips before running the model. The resulting model is the final model shown in Fig. 5. The models for non-CPP and CPP trips were compared by assessing whether the 95% confidence intervals around the coefficients overlapped. Finally, to display the data in map format, the models were used to predict the number of non-CPP and CPP trips by census tract. Because the model for CPP trips had been scaled up for comparisonâ€™s sake, the resulting prediction for CPP trips was divided by the ratio of non-CPP trips to CPP trips to get a prediction of CPP trips per dock by tract. The amount of error by tract was calculated by taking the difference of predicted trips per dock and actual trips per dock for both non-CPP and CPP rides. The data was mapped using ArcGIS (Figs. 6â€“11)
Discussion of Results The r2 value for the best model is 0.8556, suggesting that the model accounted for about 86% of the variation in non-CPP trips per dock. The most significant variables in the best model for non-CPP trips, as judged by comparing the absolute value of standardized coefficients, are those for the percentage of residents walking and biking to work. For each unit increase in the percentage of residents biking to work, a census tract can expect approximately 7 additional trips per dock to originate in the tract. For each unit increase in the percentage of residents walking to work, a census tract can expect approximately 2 additional trips per dock to originate in the tract. Unlike the other independent variables, the percentage of population over the age of 65 is negatively associated with number of trips per dock. For each unit increase in the percentage of population over the age of 65, a census tract can expect 92 fewer trips per dock to originate in the tract. Given the strong correlation between the number of trips per dock and the number of CaBi stations within a half mile of the tract identified earlier, it is
surprising that the number of CaBi stations within a half mile is of questionable significance, with the 95% confidence interval crossing 0. Due to the strong significance of both walking and biking to work, this analysis posits that these two factors are serving as a proxy for built environment factors that make bikeshare a more convenient mode of travel. In other words, the same environments that are favorable for residents to walk or bike to work also make a favorable environment for producing many bikeshare trips. Using the same variables to predict CPP trips per dock by census tract produced a model with an r2 of 0.3413, meaning that the model only accounted for about 34% of the variation in the number of CPP trips per dock by census tract. Of the five independent variables included in both regression models, only the percentage of residents age 65 or older was found to be significant in both models. Even so, the relationship between percentage age 65 and older was negative in the non-CPP trip model, and positive in the CPP trip model. Other studies have shown that people 65 and over are less likely to travel by bicycle, and thus a positive association to the number of bikeshare trips is the opposite of what should be expected. The confidence intervals for all four other variables straddled 0 and had relatively high p-values greater than 0.1, and thus are not significantly associated with CPP bikeshare trips per dock by census tract. To determine whether the model for non-CPP trips per dock by census tract was a good predictor of CPP trips per dock by census tract, the analysis compared the 95% confidence intervals around the regression coefficients (with CPP trips scaled by the ratio of non-CPP trips to CPP trips). The confidence intervals for each independent variable except for the percentage over age 65 overlapped the confidence intervals of the corresponding regression coefficients for non-CPP trips, suggesting that variables most associated with non-CPP trips are poor predictors of CPP trips per dock by census tract. The map in Fig. 11 shows that the model tended to predict lower numbers of CPP trips per dock by tract in the Southeast quadrant of DC than were observed in the data set. This pattern is not surprising due to the relative concentration of low-income residents of the DC-area in the Southeast quadrant of DC. This analysis could benefit from the inclusion of data on certain missing factors. Most importantly, including a more direct inventory of built environment features, such as the length of bike lanes within the census tract, a â€œbike scoreâ€?, or the topography of the tract could separate out the effect of the number of bikers and walkers in the tract from specific features of the built environment. While the characteristics of census tract residents have a significant association with bikeshare trips per dock originating in the tract, this analysis fails to account for commercial factors that may also be associated with bikeshare trips. Removing tracts with no residents results in the exclusion of certain census tracts which have no permanent residents from the model, including the tract containing the National Mall, which has one of the highest rates of bikeshare trip generation of any census tract in the Washington, DC area. Potential additional explanatory variables for inclusion in further study include the number of jobs in the
census tract, as well as the percentage of the land area dedicated to residential and commercial uses. The fact that the CPP program is currently limited to residents of Washington, DC and Arlington County, VA also complicates this analysis, as this likely skews CPP ridership toward those two jurisdictions.
Conclusion This analysis determined that data for non-CPP trips per dock by census tract does not sufficiently explain the number of CPP trips per dock originating from a census tract. This means that some other factors not captured in the best regression model from this analysis are associated with trips by CPP members. Given the small sample size of CPP trips relative to the overall data set, this analysis could be improved by including more data for travel behavior by CPP members as it becomes available. This analysis would have more practical applications for predicting CPP trips by running it at the station level once sufficient data on those trips has been collected. Though this model has strong predictive powers for non-CPP riders, it does not do a good job of predicting CPP ridership, and therefore more data is needed to account for CPP travel behavior in future models. Fig. 6
Using multivariate regressions to predict Capital Bikeshare ridership for low-income users enrolled in the Capital Partners Program. Final p...
Published on Feb 9, 2018
Using multivariate regressions to predict Capital Bikeshare ridership for low-income users enrolled in the Capital Partners Program. Final p...