Jakarta Smart City Traffic Safety Technical Report Partner(s): Jakarta Smart City (JSC) and United Nations Pulse Lab Jakarta (PLJ) Project Manager: Katy Dupre Technical Mentor: Joe Walsh Fellows: Alex Fout, Joao Caldeira, Aniket Kesari, Raesetje Sefala Abstract I. Problem Background and Social Impact II. Data Science Problem Formulation III. Data Summary IV. Analytical Approach V. Evaluation Methodology VI. Deployment Approach VII. Value Delivered VIII. Limitations IX. Next Steps X. Appendix Abstract The Data Science for Social Good (DSSG) Fellowship at the University of Chicago partnered with Jakarta Smart City (JSC) and UN Global Pulse (PLJ) to deliver a video analysis pipeline for the purpose of improving traffic safety in Jakarta. Over the summer, the fellows created a pipeline that transforms raw traffic video footage into a format ready for analysis. The results are stored in a database, which will help the city of Jakarta analyze traffic patterns throughout the city. By analyzing these patterns, our partners will better understand how human behavior and built infrastructure contribute to traffic challenges and safety risks. With an improved understanding of what is happening, when and where it’s happening, why it’s happening and who is involved, the city will be able to design and implement informed medium- and long-term interventions to improve safety and congestion throughout the city. I. Problem Background and Social Impact Jakarta Smart City, in cooperation with United Nations Global Pulse (Jakarta), aims to improve traffic safety by harnessing information gleaned from raw closed-circuit television (CCTV)video footage posted at various intersections throughout the city. To date Jakarta has maintained these cameras, but the amount of footage is too voluminous for manual monitoring. The current CCTV 1
network infrastructure limits the city’s ability to feasibly monitor roadway behaviors, analyze traffic patterns and develop cost effective and efficient solutions. To encourage effective and efficient resource allocation, the city’s CCTV network requires some degree of automation. . Thus, the project partners wanted to transform this raw video footage into useful information. As such, the primary data science problem that we tackled is a learning problem. Specifically, we transformed unstructured data into a structured format, so that it may be used for analysis. II. Data Science Problem Formulation Most common machine learning applications can be divided into two main classes: tasks whose main goal is to find complex patterns in large amounts of high-dimensional data, which are hard for humans to fully understand; and tasks which humans can do well, such as identifying objects in images, but are labor-intensive and hard to scale. This project falls into the latter bucket. Our goal is then one of data creation, taking in a large amount of video data and transforming it into a format that can more easily lends itself to analysis. The core data science problem was converting unstructured video data into structured traffic data. Because our data source consisted of raw videos, we were not working with labeled data. More importantly, the fundamental question was not motivated by prediction, but rather identification. Our goal was to develop a machine learning pipeline incorporating object detection, classification, and description. We were interested in detecting cars, trucks, bikes, pedestrians, etc. in videos, and generating structured data that contained information about counts, direction, etc. Our primary goal was to identify object characteristics, capture additional details, assign structured values and store that information in a database which can be easily integrated into the partner’s envir. The motivation behind this goal was to provide our partner with a baseline foundation designed for scalability. Thus, our choice of algorithms used both supervised and unsupervised techniques. We deployed pre-trained models that were used for classifying objects into categories of interest. Our initial models were trained on general datasets that contained images of commonplace objects. We also made extensive use of techniques that segmented images, detected corners of objects, and calculated optical flow. Overall, the key takeaway is that we were mainly interested in generating data tables that could be used for further analysis. The main machine learning components of our project came in the form of computer vision methods. These methods were helpful in distinguishing between objects in frames, calculating relevant statistics for those objects, and reporting this information in a comprehensible manner. 2
III. The primary data source was approximately 700GB of video footage taken from seven intersections in Jakarta. The relevant information can be broken down as follows: ● Metadata ○ Camera location/ID ○ Video file name and timestamps ○ Video subtitles to track frame-by-frame timestamps ○ Frame statistics (number, density, etc.) ● Extracted Features ○ Classes of objects (car, pedestrian, bike, etc.) ○ Positions of objects ○ Direction of objects (approximated by optical flow) ○ Counts of objects IV. Analytical Approach Our analytical approach can primarily be broken down into three categories: 1. Data Extraction and Descriptive Analysis 2. Computer Vision Tasks 3. Pipeline Development Data Extraction and Descriptive Analysis Early in the summer, much of our attention was devoted to creating a system that could manage videos quickly and efficiently. Our goals were to create scripts that could adequately download new videos, upload them to a server, and produce useful summary statistics. We used Amazon Web Services (AWS) infrastructure to manage these particular tasks. Download New Videos We provide a python script that downloads new videos from the Bali Tower CCTV cameras across Jakarta. The script includes web url information for 18 cameras, including the 7 that were initially provided to us. It loops through the provided list of cameras and download videos from each. This script takes as its arguments the list of CCTV cameras, the amount of video the user would like to download (in hours), and the starting time. Please note that Bali Tower makes only the previous 48 hours of video footage available for any given time. 3
Upload Videos to a AWS S3 Server and Postgres Database We also provided the scripts necessary to work with server and database infrastructures. Because of the volume of videos that the partners will eventually deal with, we explored potential storage solutions. Eventually, we settled on Amazon Simple Storage Service (S3) as our preferred place for storing newly downloaded videos. S3 has several advantages, including effectively unlimited storage and faster transfer times than our Network File System. We provided scripts that will upload video files from a local directory to S3, as well as scripts that can retrieve files from S3 and obtain information about them. We supplied scripts that create the database infrastructure needed to manage the output of the pipeline. We implement these solutions in postgres, and provide the full scripts that can be used to create a new database, populate it with the results of the pipeline, and conduct preliminary queries and analyses. If a user wishes to port these results to another database system, this should be a straightforward undertaking as the data produced by the pipeline will be organized in a structured format. Summary Statistics A user who wishes to learn more about the videos themselves will also find the suite of descriptive statistics scripts useful. As we started the project, we explored issues surrounding video quality and consistency time coverage. Because these insights were critical in informing how we approached the data, we developed a series of scripts that generate descriptive statistics about the video metadata. For instance, we wrote a script that measures frame density across the length of a video. This measure is useful for understanding when a video’s frame rate increases or decreases dramatically. A slowdown in frame rate may indicate video corruption issues (though this is not necessarily the case). An increase in framerate, on the other hand, usually indicates some number of dropped frames, with the remaining frames glued together at a different framerate. When compared against time, patterns might emerge about when cameras suffer drops in quality. We also wrote a script that extracts subtitle information from videos, with those subtitles containing datetime information. However, we were only able to successfully implement this method for videos that were directly supplied by the partner. Videos downloaded from the Bali Tower open data portals did not contain this subtitle information, so we could not implement a similar method. For frames with subtitle information, we critically were able to provide a script that measures “skips” in the video. In some videos, the footage might skip several seconds or minutes, and any resulting analysis will be broken. Insofar as an analyst can detect such videos before they are 4
dropped into a pipeline, the analyst will be able to save considerable time by not generating potentially nonsensical results. While helpful, it should be noted that this method involving subtitles is not enough to completely clean the videos from corrupted frames. Computer Vision Tasks The computer vision tasks form the backbone of the analysis that transforms video into structured data. Many of these tasks were mainly implemented through commonly used tools such as OpenCV. For GPU-intensive tasks, we used both Keras Tensorflow backend and PyTorch. At a high level, our main tasks can be summarized as follows: a. Object Detection i. Recognize the existence of various objects in a frame ii. Construct measurements of each object’s position and spatial extent to separate different objects in the same frame iii. Deployment Method: YOLO3 b. Object Classification i. Distinguish different objects and accurately categorize them (i.e. properly label cars as cars, distinguish people on motorcycles v. pedestrians, and distinguish other common vehicle categories) ii. Deployment Method: YOLO3 c. Motion Detection i. Obtain an estimate for the direction and magnitude of an object’s displacement from the previous frame ii. Deployment Method: Lucas-Kanade Sparse Optical Flow d. Semantic Segmentation i. Distinguish different surfaces from one another (i.e. separate roads from sidewalks) ii. Deployment Method: WideResNet38 + DeepLab3 In terms of particular algorithms and methods, we deployed the following: Methods ● YOLO3 - We used YOLO3 for object detection and classification. The model outputs bounding boxes for each object, as well as a predicted class. YOLO3 is not optimized for Jakarta in particular, and should be retrained with additional images that are Jakarta-specific (we include methods for generating such images). 5
● Lucas-Kanade Sparse Optical Flow - We used the Lucas-Kanade method for calculating optical flow. This method solves a least squares approximation of the motion in the neighborhood of a given pixel across two frames. The output of this algorithm returns displacement vectors for each pixel. We also provide a method for placing these pixels in their appropriate boxes, and calculating average motion for the object. The major disadvantage associated with Lucas-Kanade is that it tends to fail when the magnitude of motion is large, which in our case can be a problem when traffic is moving quickly. ● WideResNet38 + DeepLab3 - We used the WideResNet38 + DeepLab3 pre-trained algorithm to segment images. Because semantic segmentation is a slow process, we only use it for segmenting surfaces such as roads and sidewalks, rather than every object in an image. The algorithm essentially makes a guess as to which class a particular pixel belongs to, and does so by borrowing information from surrounding pixels. The algorithm was originally trained on the Mapillary dataset, and can be improved upon.
Figure _: YOLO - Object Detection
Figure _: Lucas-Kanade Optical Flow - Motion Detection
Figure _: WideResNet38 + DeepLab3 - Semantic Segmentation We also started developing several modules that did not make it into the pipeline but may be useful for future users. These would implement the following additional methods: Methods Under Development ● Background Subtraction - We completed scripts that can apply either a Mixture of Gaussians (MOG2) or K-Nearest Neighbor (KNN) approach to background subtraction. A background subtraction process basically takes a range of previous frames (for example, 20), and subtracts any static parts of the image across those frames. The outputted mask can improve corner detection in the Lucas-Kanade process, smooth frames with corrupted elements (i.e. ghost buses that persist in an image long after the object left the frame), and other downstream computer vision tasks.
● Gunnar-Farneback Dense Optical Flow - We completed a script that can implement a version of the Gunnar-Farneback Dense Optical Flow. Unlike the Lucas-Kanade method, Gunnar-Farneback does not have the same tendency to fail when dealing with fast moving points. However, dense methods solve optical flow equations for every point in an image, and therefore are computationally heavy and run much slower than real time. We chose to prioritize implementing Lucas-Kanade for this reason, but Gunnar-Farneback may be an appropriate choice in situations with fast moving objects and the user is not concerned about computation speed. ● Faster R-CNN - We hoped to implement a “Faster R-CNN” method, which expands to “Faster Regional Convolutional Neural Network.” Faster R-CNN improves on other implementations by using “selective search” to determine region boundaries (as opposed to exhaustive search). This method basically relies on the algorithm creating “anchors” that are likely to contain objects, and narrows down these anchors until an object is found. We expect that successful implementation would improve overall object detection, and therefore help with downstream processes such as classification and tracking.
Figure _: KNN Background Subtraction Pipeline Development and Functionality The core achievement of this summer was the development of an end-to-end pipeline capable of converting video frame information into a format ready for analysis. The major improvement that we offered in this regard was the implementation of a “streaming” approach to handling videos instead of a “batch” approach. Our implementation is faster than a batch approach and modular, therefore allowing a user to add and replace modules with ease. The streaming approach achieves considerable efficiency gains over the batch approach. A batch approach would be the equivalent of processing an entire video in one script. While this sort of process is fairly easy to code initially, it creates a number of problems for future users. This type of code is easy to break with small changes, and even experienced users would struggle to add new modules without potentially compromising another part of the script. Moreover, a batch 8
process moves slowly, and it would be difficult to ignore parts of the pipeline or determine where bottlenecks are occuring. In contrast, our streaming approach overcomes these challenges, thus future-proofing it for end users. The streaming approach breaks a video into individual frames at the beginning of the pipeline. It then passes these frames through a system of workers and queues. Essentially, each worker is given a particular “task” (i.e. object detection) that it performs on each frame. Once it finishes a task, it sends that frame to the next queue, where the frame waits until the next worker is ready to process it. Frame order is preserved, and at the end, a worker puts frames back together to output the original video with any new annotations or analysis. The workers also output quantitative information about objects counts, direction, etc. that can be loaded to a database. See Figure __ for a conceptual overview of the streaming system, and Figure __ for an illustration of our particular pipeline workers.
Figure _: Sample Stream Processing Pipeline Logic
Figure _: Current State of Our Pipeline 9
We evaluated object detection, classification, and motion detection by comparing our model outputs to the “ground truth.” In this case, the ground truth was hand-labeled objects. We used a tool called the “Computer Vision Annotation Tool” (CVAT) to collect object labels. CVAT presents the user with a video clip, and allows the user to draw bounding boxes and frame-by-frame trajectories for each object in a video. Our partners, Jakarta Smart City, assisted with the labeling of these videos. For detection, we used precision and recall to determine whether the detector picked up objects. In this case, recall is the proportion of objects. High recall means the model found most of the objects. Precision is the proportion of objects correctly detected. High precision means we can have confidence that an object detected by the model is actually an object. There is a tradeoff between precision and recall, where detecting more objects also leads to more mistakes. Ideally, the box drawn by the model will exactly align with the box drawn by the human, but in practice there will be differences. We used an “Intersection Over Union” (IOU) approach to determine whether two boxes were the same. IOU takes the area of the intersection of two boxes, and divides this by area of their non-intersected portions. Basically we were interested in seeing how well our predicted boxes track actual objects in the frame. By varying the IOU threshold, we can also see how precision and recall change. We used a similar approach for object classification. For each box, we are also interested in whether the predicted class matched its true label. For instance, if the model predicted that an object was a car, did that prediction match the hand-coded label? We combined our above metrics of precision/recall and IOU to calculate “Mean Average Precision” (mAP). mAP essentially gave us a measure of precision/recall across each object types.
Figure _: Example Precision-Recall plot for car detection/classification in a frame
Frame _: PR Plot varying confidence threshold for class prediction. A score of 0 means the model is certain that it correctly classified the object, while a score of 1 means the model has no confidence in its classification.
Figure ___: We use cosine similarity to evaluate movement. The smaller the angle between the angle assigned by our object flow method and the angle identified by human labels, the better. Image source: Safari Books Online. To validate movement, we used cosine similarity. The output from optical flow provides points and their displacement vectors. We then provided a method to average displacement vectors to measure motion detection performance.. We compared these estimated displacements to human-labeled displacements. VI. Deployment Approach Our primary motivation was to provide our partner with a baseline solution to one data science’s newly created challenges. As technology advances, so must the data scientists tasked with obtaining, transforming and making use of our data. This project provides cities with an approach that can be applied horizontally across various Smart City initiatives. We hope that by creating a simple, but reliable and effective approach to transforming unstructured video data into structured data, our efforts this summer can easily be replicated and deployed in cities across the globe. Our approach resulted in a pipeline that converts raw, unstructured video frames into data ready for analysis. One of our main goals was to design a product that could be easily replicated by other cities experiencing similar challenges of making use of large, unstructured video data. We also wanted to arm our partners with baseline knowledge of how our approach to the pipeline design could be easily adapted for similar Smart City data challenges. The final, and most important goal, was to deliver a product that solves a fundamental challenge of cities in today’s digital age -- how can we use data science and technology to solve problems using data-informed decision making? . Assuming this pipeline is functional, the partners can then use it to merge these data with other sources that they have collected, and build on it by conducting traffic analyses that will help them in their overall goal of improving traffic safety conditions. VII. Value Delivered 12
The main value added by this project was giving the partner a toolbox that can be used to analyze video data. Prior to this project, Jakarta did not systematically analyze its video data. To our knowledge, the videos were monitored primarily by human beings, but not used as data. In large part, this status quo was motivated by limited resources and a lack of knowledge for how city employees can extract useful information. It was also motivated by resource constraints of Jakarta Smart City and how to best support the organization's goals. In order to better understand how to effectively and efficiently use these resources, they partnered with Data Science for Social Good (DSSG) with the hopes of creating scalable solutions. Scaling manual video analysis is an enormously difficult undertaking. While one individual may be able to watch a video and occasionally detect wrongdoing, doing this across multiple videos would divide the person’s attention and drastically reduce their effectiveness. Having multiple people constantly watching videos would be cost prohibitive. And even if it were possible for multiple individuals to watch videos full time, this by itself would not guarantee that they develop macro-level insights about how behaviors and conditions impact overall traffic patterns. Our final deliverable considerably mitigates this problem by providing the foundation to automatically convert raw, unstructured video frames into a data ready for analysis. While it is not exhaustive, the delivered pipeline provides the essential building blocks for a fully-realized product. We intentionally built the pipeline in a modular and highly configurable fashion so our partners can improve and adapt the pipeline to new cameras and problems for several years to come. Our work provides two major contributions. Most immediately, the pipeline can produce high-quality traffic data at various intersections in Jakarta. This will give city planners unprecedented insights, and inform their infrastructure planning, deployment of traffic management resources, and general traffic regulations. We expect more cities in Indonesia to deploy similar systems. Looking further ahead, our work may provide a template for other developing cities in Southeast Asia and globally that are interested in developing smart city initiatives. Our project helps illustrate the potential for using remote sensors in producing useful, interesting data and enhancing data-driven insights. Such sensors are commonplace throughout the world’s major cities, and we designed the system with enough generality that Jakarta could easily become a model for other global cities to follow. VIII. Limitations The major limitation that any end user should be cognizant of is the time constraints over the summer that prevented us from training and testing competing object detection and classification 13
models. We provided strong baseline models that worked well out-of-the-box, but the partner will likely benefit from training models on Jakarta data. For example, our model generally fails to correctly classify “tuk-tuks.” These vehicles are quite common throughout Jakarta, but would not be counted in the default configuration. To help overcome this particular limitations, we strove to include as many tools as possible to aid with collecting more data and validating the training the process. In particular, we provided detailed instructions about how to use the Computer Vision Annotation Tool (CVAT). Otherwise, we provided much of the code necessary to run a model once it has been trained, and any improvements to our baseline models can be easily plugged into the pipeline. Otherwise, the main limitation is that there are a number of modules that could use improvement, and future users may need to add functionality that we did not envision. For instance, we chose to use sparse optical flow to approximate the direction of an object’s movement. But this choice may be inappropriate in two situations. First, if a vehicle is moving very quickly, our method may fail to pick up this motion because it relies on the assumption that changes are relatively small from frame-to-frame. Second, our method does not enable a user to track an object throughout its entire lifetime in a video. We explored implementing a method that tracked an object for its entire lifetime, but ultimately decided not to spend time on it as our partners only needed information about direction in a given instant. There are potential improvements to several modules in the pipeline, and the user should be aware of these limitations before deployment. IX. Next Steps Given more time, we would have implemented more modules for a user to choose from. To start with, we would have fully implemented the “in progress” modules mentioned above (background subtraction, dense optical flow, and Faster R-CNN). These modules would give a user more options to configure depending on the specific context that they are working with. Because the code for these modules was started, it would take minimal effort to integrate them into the pipeline. Thus, finishing these modules would be a natural first step. Beyond these modules, the most obvious improvement would come from training object detection and classification algorithms that are specific to Jakarta. As mentioned earlier, there were several Jakarta-specific objects that our baseline models did not detect. Retraining these models with new examples of these sorts of objects would improve their accuracy, and make them deployable in Jakarta. Similarly, if this pipeline was used in other cities, the algorithms might need to be retrained on images from those places. 14
Otherwise, an end user may find other modules and enhancements that they would like to plug into the pipeline. We provided a basis for a minimum viable product, but there are a number of potential enhancements that may prove useful, depending on context. Appendix