Page 1

Jakarta Smart City Traffic Safety Technical Report   Partner(s)​: Jakarta Smart City (JSC) and United Nations Pulse Lab Jakarta (PLJ)  Project Manager​: Katy Dupre  Technical Mentor​: Joe Walsh  Fellows​: Alex Fout, Joao Caldeira, Aniket Kesari, Raesetje Sefala    Abstract  I. Problem Background and Social Impact  II. Data Science Problem Formulation  III. Data Summary  IV. Analytical Approach  V. Evaluation Methodology  VI. Deployment Approach  VII. Value Delivered  VIII. Limitations  IX. Next Steps  X. Appendix    Abstract    The Data Science for Social Good (DSSG) Fellowship at the University of Chicago partnered with  Jakarta Smart City (JSC) and UN Global Pulse (PLJ) to deliver a video analysis pipeline for the  purpose of improving traffic safety in Jakarta. Over the summer, the fellows created a pipeline that  transforms raw traffic video footage into a format ready for analysis. The results are stored in a  database, which will help the city of Jakarta analyze traffic patterns throughout the city. By analyzing  these patterns, our partners will better understand how human behavior and built infrastructure  contribute to traffic challenges and safety risks. With an improved understanding of what is  happening, when and where it’s happening, why it’s happening and who is involved, the city will be  able to design and implement informed medium- and long-term interventions to improve safety and  congestion throughout the city.    I. Problem Background and Social Impact    Jakarta Smart City, in cooperation with United Nations Global Pulse (Jakarta), aims to improve  traffic safety by harnessing information gleaned from raw closed-circuit television (CCTV)video  footage posted at various intersections throughout the city. To date Jakarta has maintained these  cameras, but the amount of footage is too voluminous for manual monitoring. The current CCTV  1

network infrastructure limits the city’s ability to feasibly monitor roadway behaviors, analyze traffic patterns and develop cost effective and efficient solutions. To encourage effective and efficient  resource allocation, the city’s CCTV network requires some degree of automation. . Thus, the  project partners wanted to transform this raw video footage into useful information. As such, the  primary data science problem that we tackled is a learning problem. Specifically, we transformed  unstructured data into a structured format, so that it may be used for analysis.    II. Data Science Problem Formulation    Most common machine learning applications can be divided into two main classes: tasks whose  main goal is to find complex patterns in large amounts of high-dimensional data, which are hard for  humans to fully understand; and tasks which humans can do well, such as identifying objects in  images, but are labor-intensive and hard to scale. This project falls into the latter bucket. Our goal is  then one of data creation, taking in a large amount of video data and transforming it into a format  that can more easily lends itself to analysis.    The core data science problem was converting unstructured video data into structured traffic data.  Because our data source consisted of raw videos, we were not working with labeled data. More  importantly, the fundamental question was not motivated by prediction, but rather identification.  Our goal was to develop a machine learning pipeline incorporating object detection, classification,  and description.      We were interested in detecting cars, trucks, bikes, pedestrians, etc. in videos, and generating  structured data that contained information about counts, direction, etc. Our primary goal was to  identify object characteristics, capture additional details, assign structured values and store that  information in a database which can be easily integrated into the partner’s envir. The motivation  behind this goal was to provide our partner with a baseline foundation designed for scalability.    Thus, our choice of algorithms used both supervised and unsupervised techniques. We deployed  pre-trained models that were used for classifying objects into categories of interest. Our initial  models were trained on general datasets that contained images of commonplace objects. We also  made extensive use of techniques that segmented images, detected corners of objects, and calculated  optical flow.    Overall, the key takeaway is that we were mainly interested in generating data tables that could be  used for further analysis. The main machine learning components of our project came in the form  of computer vision methods. These methods were helpful in distinguishing between objects in  frames, calculating relevant statistics for those objects, and reporting this information in a  comprehensible manner.  2

Data Summary

III. The primary data source was approximately 700GB of video footage taken from seven intersections  in Jakarta. The relevant information can be broken down as follows:  ● Metadata   ○ Camera location/ID  ○ Video file name and timestamps  ○ Video subtitles to track frame-by-frame timestamps  ○ Frame statistics (number, density, etc.)  ● Extracted Features  ○ Classes of objects (car, pedestrian, bike, etc.)  ○ Positions of objects  ○ Direction of objects (approximated by optical flow)  ○ Counts of objects    IV. Analytical Approach    Our analytical approach can primarily be broken down into three categories:  1. Data Extraction and Descriptive Analysis  2. Computer Vision Tasks  3. Pipeline Development    Data Extraction and Descriptive Analysis    Early in the summer, much of our attention was devoted to creating a system that could manage  videos quickly and efficiently. Our goals were to create scripts that could adequately download new  videos, upload them to a server, and produce useful summary statistics. We used Amazon Web  Services (AWS) infrastructure to manage these particular tasks.     Download New Videos    We provide a python script that downloads new videos from the Bali Tower CCTV cameras across  Jakarta. The script includes web url information for 18 cameras, including the 7 that were initially  provided to us. It loops through the provided list of cameras and download videos from each. This  script takes as its arguments the list of CCTV cameras, the amount of video the user would like to  download (in hours), and the starting time. Please note that Bali Tower makes only the previous 48  hours of video footage available for any given time.     3

Upload Videos to a AWS S3 Server and Postgres Database   We also provided the scripts necessary to work with server and database infrastructures. Because of  the volume of videos that the partners will eventually deal with, we explored potential storage  solutions. Eventually, we settled on Amazon Simple Storage Service (S3) as our preferred place for  storing newly downloaded videos. S3 has several advantages, including effectively unlimited storage  and faster transfer times than our Network File System. We provided scripts that will upload video  files from a local directory to S3, as well as scripts that can retrieve files from S3 and obtain  information about them.    We supplied scripts that create the database infrastructure needed to manage the output of the  pipeline. We implement these solutions in postgres, and provide the full scripts that can be used to  create a new database, populate it with the results of the pipeline, and conduct preliminary queries  and analyses. If a user wishes to port these results to another database system, this should be a  straightforward undertaking as the data produced by the pipeline will be organized in a structured  format.    Summary Statistics    A user who wishes to learn more about the videos themselves will also find the suite of descriptive  statistics scripts useful. As we started the project, we explored issues surrounding video quality and  consistency time coverage. Because these insights were critical in informing how we approached the  data, we developed a series of scripts that generate descriptive statistics about the video metadata.    For instance, we wrote a script that measures frame density across the length of a video. This  measure is useful for understanding when a video’s frame rate increases or decreases dramatically. A  slowdown in frame rate may indicate video corruption issues (though this is not necessarily the  case). An increase in framerate, on the other hand, usually indicates some number of dropped  frames, with the remaining frames glued together at a different framerate. When compared against  time, patterns might emerge about when cameras suffer drops in quality.    We also wrote a script that extracts subtitle information from videos, with those subtitles containing  datetime information. However, we were only able to successfully implement this method for videos  that were directly supplied by the partner. Videos downloaded from the Bali Tower open data  portals did not contain this subtitle information, so we could not implement a similar method.    For frames with subtitle information, we critically were able to provide a script that measures  “skips” in the video. In some videos, the footage might skip several seconds or minutes, and any  resulting analysis will be broken. Insofar as an analyst can detect such videos before they are  4

dropped into a pipeline, the analyst will be able to save considerable time by not generating potentially nonsensical results. While helpful, it should be noted that this method involving subtitles  is not enough to completely clean the videos from corrupted frames.    Computer Vision Tasks    The computer vision tasks form the backbone of the analysis that transforms video into structured  data. Many of these tasks were mainly implemented through commonly used tools such as OpenCV.  For GPU-intensive tasks, we used both Keras Tensorflow backend and PyTorch. At a high level,  our main tasks can be summarized as follows:      a. Object Detection  i. Recognize the existence of various objects in a frame  ii. Construct measurements of each object’s position and spatial extent to separate  different objects in the same frame  iii. Deployment Method: YOLO3    b. Object Classification  i. Distinguish different objects and accurately categorize them (i.e. properly label cars  as cars, distinguish people on motorcycles v. pedestrians, and distinguish other  common vehicle categories)  ii. Deployment Method: YOLO3    c. Motion Detection  i. Obtain an estimate for the direction and magnitude of an object’s displacement from  the previous frame  ii. Deployment Method: Lucas-Kanade Sparse Optical Flow    d. Semantic Segmentation  i. Distinguish different surfaces from one another (i.e. separate roads from sidewalks)  ii. Deployment Method: WideResNet38 + DeepLab3    In terms of particular algorithms and methods, we deployed the following:    Methods  ● YOLO3​ - We used YOLO3 for object detection and classification. The model outputs  bounding boxes for each object, as well as a predicted class. YOLO3 is not optimized for  Jakarta in particular, and should be retrained with additional images that are Jakarta-specific  (we include methods for generating such images).  5

● Lucas-Kanade Sparse Optical Flow​ - We used the Lucas-Kanade method for calculating optical flow. This method solves a least squares approximation of the motion in the  neighborhood of a given pixel across two frames. The output of this algorithm returns  displacement vectors for each pixel. We also provide a method for placing these pixels in  their appropriate boxes, and calculating average motion for the object. The major  disadvantage associated with Lucas-Kanade is that it tends to fail when the magnitude of  motion is large, which in our case can be a problem when traffic is moving quickly.  ● WideResNet38 + DeepLab3​ - We used the WideResNet38 + DeepLab3 pre-trained  algorithm to segment images. Because semantic segmentation is a slow process, we only use  it for segmenting surfaces such as roads and sidewalks, rather than every object in an image.  The algorithm essentially makes a guess as to which class a particular pixel belongs to, and  does so by borrowing information from surrounding pixels. The algorithm was originally  trained on the Mapillary dataset, and can be improved upon.   

Figure _: YOLO - Object Detection     


Figure _: Lucas-Kanade Optical Flow - Motion Detection

Figure _: WideResNet38 + DeepLab3 - Semantic Segmentation    We also started developing several modules that did not make it into the pipeline but may be useful  for future users. These would implement the following additional methods:    Methods Under Development    ● Background Subtraction​ - We completed scripts that can apply either a Mixture of Gaussians  (MOG2) or K-Nearest Neighbor (KNN) approach to background subtraction. A  background subtraction process basically takes a range of previous frames (for example, 20),  and subtracts any static parts of the image across those frames. The outputted mask can  improve corner detection in the Lucas-Kanade process, smooth frames with corrupted  elements (i.e. ghost buses that persist in an image long after the object left the frame), and  other downstream computer vision tasks. 


● Gunnar-Farneback Dense Optical Flow​ - We completed a script that can implement a version of the Gunnar-Farneback Dense Optical Flow. Unlike the Lucas-Kanade method,  Gunnar-Farneback does not have the same tendency to fail when dealing with fast moving  points. However, dense methods solve optical flow equations for ​every​ point in an image, and  therefore are computationally heavy and run much slower than real time. We chose to  prioritize implementing Lucas-Kanade for this reason, but Gunnar-Farneback may be an  appropriate choice in situations with fast moving objects and the user is not concerned  about computation speed.  ● Faster R-CNN​ - We hoped to implement a “Faster R-CNN” method, which expands to  “Faster Regional Convolutional Neural Network.” Faster R-CNN improves on other  implementations by using “selective search” to determine region boundaries (as opposed to  exhaustive search). This method basically relies on the algorithm creating “anchors” that are  likely to contain objects, and narrows down these anchors until an object is found. We  expect that successful implementation would improve overall object detection, and therefore  help with downstream processes such as classification and tracking.   

  Figure _: KNN Background Subtraction    Pipeline Development and Functionality    The core achievement of this summer was the development of an end-to-end pipeline capable of  converting video frame information into a format ready for analysis. The major improvement that  we offered in this regard was the implementation of a “streaming” approach to handling videos  instead of a “batch” approach. Our implementation is faster than a batch approach and modular,  therefore allowing a user to add and replace modules with ease. The streaming approach achieves  considerable efficiency gains over the batch approach.     A batch approach would be the equivalent of processing an entire video in one script. While this  sort of process is fairly easy to code initially, it creates a number of problems for future users. This  type of code is easy to break with small changes, and even experienced users would struggle to add  new modules without potentially compromising another part of the script. Moreover, a batch  8

process moves slowly, and it would be difficult to ignore parts of the pipeline or determine where bottlenecks are occuring.     In contrast, our streaming approach overcomes these challenges, thus future-proofing it for end  users. The streaming approach breaks a video into individual frames at the beginning of the pipeline.  It then passes these frames through a system of workers and queues. Essentially, each worker is  given a particular “task” (i.e. object detection) that it performs on each frame. Once it finishes a task,  it sends that frame to the next queue, where the frame waits until the next worker is ready to process  it. Frame order is preserved, and at the end, a worker puts frames back together to output the  original video with any new annotations or analysis. The workers also output quantitative  information about objects counts, direction, etc. that can be loaded to a database. See Figure __ for  a conceptual overview of the streaming system, and Figure __ for an illustration of our particular  pipeline workers.     

  Figure _: Sample Stream Processing Pipeline Logic 

  Figure _: Current State of Our Pipeline    9


Evaluation Methodology

We evaluated object detection, classification, and motion detection by comparing our model outputs  to the “ground truth.” In this case, the ground truth was hand-labeled objects. We used a tool called  the “​Computer Vision Annotation Tool​” (CVAT) to collect object labels. CVAT presents the user  with a video clip, and allows the user to draw bounding boxes and frame-by-frame trajectories for  each object in a video. Our partners, Jakarta Smart City, assisted with the labeling of these videos.    For detection, we used precision and recall to determine whether the detector picked up objects. In  this case, recall is the proportion of objects. High recall means the model found most of the objects.  Precision is the proportion of objects correctly detected. High precision means we can have  confidence that an object detected by the model is actually an object. There is a tradeoff between  precision and recall, where detecting more objects also leads to more mistakes.     Ideally, the box drawn by the model will exactly align with the box drawn by the human, but in  practice there will be differences. We used an “Intersection Over Union” (IOU) approach to  determine whether two boxes were the same. IOU takes the area of the intersection of two boxes,  and divides this by area of their non-intersected portions. Basically we were interested in seeing how  well our predicted boxes track actual objects in the frame. By varying the IOU threshold, we can  also see how precision and recall change.     We used a similar approach for object classification. For each box, we are also interested in whether  the predicted class matched its true label. For instance, if the model predicted that an object was a  car, did that prediction match the hand-coded label? We combined our above metrics of  precision/recall and IOU to calculate “Mean Average Precision” (mAP). mAP essentially gave us a  measure of precision/recall across each object types. 


Figure _: Example Precision-Recall plot for car detection/classification in a frame  

Frame _: PR Plot varying confidence threshold for class prediction. A score of 0 means the model is  certain that it correctly classified the object, while a score of 1 means the model has no confidence in  its classification.   


Figure ___: We use cosine similarity to evaluate movement. The smaller the angle between the angle assigned by our object flow method and the angle identified by human labels, the better. Image source: Safari Books Online.    To validate movement, we used cosine similarity. The output from optical flow provides points and  their displacement vectors. We then provided a method to average displacement vectors to measure  motion detection performance.. We compared these estimated displacements to human-labeled  displacements.      VI. Deployment Approach    Our primary motivation was to provide our partner with a baseline solution to one data science’s  newly created challenges. As technology advances, so must the data scientists tasked with obtaining,  transforming and making use of our data. This project provides cities with an approach that can be  applied horizontally across various Smart City initiatives. We hope that by creating a simple, but  reliable and effective approach to transforming unstructured video data into structured data, our  efforts this summer can easily be replicated and deployed in cities across the globe.     Our approach resulted in a pipeline that converts raw, unstructured video frames into data ready for  analysis. One of our main goals was to design a product that could be easily replicated by other cities  experiencing similar challenges of making use of large, unstructured video data. We also wanted to  arm our partners with baseline knowledge of how our approach to the pipeline design could be  easily adapted for similar Smart City data challenges. The final, and most important goal, was to  deliver a product that solves a fundamental challenge of cities in today’s digital age -- how can we  use data science and technology to solve problems using data-informed decision making? .  Assuming this pipeline is functional, the partners can then use it to merge these data with other  sources that they have collected, and build on it by conducting traffic analyses that will help them in  their overall goal of improving traffic safety conditions.    VII. Value Delivered  12

The main value added by this project was giving the partner a toolbox that can be used to analyze video data. Prior to this project, Jakarta did not systematically analyze its video data. To our  knowledge, the videos were monitored primarily by human beings, but not used as data.     In large part, this status quo was motivated by limited resources and a lack of knowledge for how  city employees can extract useful information. It was also motivated by resource constraints of  Jakarta Smart City and how to best support the organization's goals. In order to better understand  how to effectively and efficiently use these resources, they partnered with Data Science for Social  Good (DSSG) with the hopes of creating scalable solutions.     Scaling manual video analysis is an enormously difficult undertaking. While one individual may be  able to watch a video and occasionally detect wrongdoing, doing this across multiple videos would  divide the person’s attention and drastically reduce their effectiveness. Having multiple people  constantly watching videos would be cost prohibitive. And even if it were possible for multiple  individuals to watch videos full time, this by itself would not guarantee that they develop macro-level  insights about how behaviors and conditions impact overall traffic patterns.    Our final deliverable considerably mitigates this problem by providing the foundation to  automatically convert raw, unstructured video frames into a data ready for analysis. While it is not  exhaustive, the delivered pipeline provides the essential building blocks for a fully-realized product.  We intentionally built the pipeline in a modular and highly configurable fashion so our partners can  improve and adapt the pipeline to new cameras and problems for several years to come.    Our work provides two major contributions. Most immediately, the pipeline can produce  high-quality traffic data at various intersections in Jakarta. This will give city planners unprecedented  insights, and inform their infrastructure planning, deployment of traffic management resources, and  general traffic regulations. We expect more cities in Indonesia to deploy similar systems. Looking  further ahead, our work may provide a template for other developing cities in Southeast Asia and  globally that are interested in developing smart city initiatives. Our project helps illustrate the  potential for using remote sensors in producing useful, interesting data and enhancing data-driven  insights. Such sensors are commonplace throughout the world’s major cities, and we designed the  system with enough generality that Jakarta could easily become a model for other global cities to  follow.    VIII. Limitations    The major limitation that any end user should be cognizant of is the time constraints over the  summer that prevented us from training and testing competing object detection and classification  13

models. We provided strong baseline models that worked well out-of-the-box, but the partner will likely benefit from training models on Jakarta data. For example, our model generally fails to  correctly classify “tuk-tuks.” These vehicles are quite common throughout Jakarta, but would not be  counted in the default configuration.    To help overcome this particular limitations, we strove to include as many tools as possible to aid  with collecting more data and validating the training the process. In particular, we provided detailed  instructions about how to use the Computer Vision Annotation Tool (CVAT). Otherwise, we  provided much of the code necessary to run a model once it has been trained, and any  improvements to our baseline models can be easily plugged into the pipeline.    Otherwise, the main limitation is that there are a number of modules that could use improvement,  and future users may need to add functionality that we did not envision. For instance, we chose to  use sparse optical flow to approximate the direction of an object’s movement. But this choice may  be inappropriate in two situations. First, if a vehicle is moving very quickly, our method may fail to  pick up this motion because it relies on the assumption that changes are relatively small from  frame-to-frame. Second, our method does not enable a user to track an object throughout its entire  lifetime in a video. We explored implementing a method that tracked an object for its entire lifetime,  but ultimately decided not to spend time on it as our partners only needed information about  direction in a given instant. There are potential improvements to several modules in the pipeline,  and the user should be aware of these limitations before deployment.    IX. Next Steps    Given more time, we would have implemented more modules for a user to choose from. To start  with, we would have fully implemented the “in progress” modules mentioned above (background  subtraction, dense optical flow, and Faster R-CNN). These modules would give a user more options  to configure depending on the specific context that they are working with. Because the code for  these modules was started, it would take minimal effort to integrate them into the pipeline. Thus,  finishing these modules would be a natural first step.    Beyond these modules, the most obvious improvement would come from training object detection  and classification algorithms that are specific to Jakarta. As mentioned earlier, there were several  Jakarta-specific objects that our baseline models did not detect. Retraining these models with new  examples of these sorts of objects would improve their accuracy, and make them deployable in  Jakarta. Similarly, if this pipeline was used in other cities, the algorithms might need to be retrained  on images from those places.     14

Otherwise, an end user may find other modules and enhancements that they would like to plug into the pipeline. We provided a basis for a minimum viable product, but there are a number of potential  enhancements that may prove useful, depending on context.    Appendix     


Jakarta Smart City Traffic Safety Technical Report  
Jakarta Smart City Traffic Safety Technical Report