Multi-view point cloud fusion for LiDAR based cooperative environment detection

A key component for automated driving is 360 environment detection. The recognition capabilities of modern sensors are always limited to their direct field of view. In urban areas a lot of objects occlude important areas of interest. The information captured by another sensor from another perspective could solve such occluded situations. Furthermore, the capabilities to detect and classify various objects in the surrounding can be improved by taking multiple views into account. In order to combine the data of two sensors into one coordinate system, a rigid transformation matrix has to be derived. The accuracy of modern e.g. satellite based relative pose estimation systems is not sufficient to guarantee a suitable alignment. Therefore, a registration based approach is used in this work which aligns the captured environment data of two sensors from different positions. Thus their relative pose estimation obtained by traditional methods is improved and the data can be fused. To support this we present an approach which utilizes the uncertainty information of modern tracking systems to determine the possible field of view of the other sensor. Furthermore, it is estimated which parts of the captured data is directly visible to both, taking occlusion and shadowing effects into account. Afterwards a registration method, based on the iterative closest point (ICP) algorithm, is applied to that data in order to get an accurate alignment. The contribution of the presented approch to the achievable accuracy is shown with the help of ground truth data from a LiDAR simulation within a 3-D crossroad model. Results show that a two dimensional position and heading estimation is sufficient to initialize a successful 3-D registration process. Furthermore it is shown which initial spatial alignment is necessary to obtain suitable registration results.


Introduction
Humans were able to use indirect signals over mirrors or through the windows of other vehicles to observe relevant areas e.g. at difficult crossroad situations.To achieve a similar understanding of the surrounding with depth sensors, one important aspect for future autonomous systems will be the cooperative exchange of environment information denoted as car to car (C2C) communication.Together with the use of real time cloud or local infrastructure based services this will improve the recognition and reaction capabilities towards objects located outside the direct field-of-view (FoV) of the own sensors.E.g. the green car of Fig. 1 could provide useful information about the blue truck for the red car.The foundation of such a dense data fusion is the availability of an highly accurate relative pose estimation which can be transferred into a homogeneous transformation matrix.This is hard to derive by traditional satellite based relative localization methods but their estimate could be improved using the gathered environment data of both cars.A method for achieving this is the content of this paper which is mainly based on the master thesis by Jähn (2014).For the rest of this paper we will refer to the red car as target and to the green car as source and we will use the coordinate definitions of the right picture from Fig. 1.
To support the understanding of the reader the present paper is organized as follows: Sect. 2 provides relevant background information's which are necessary to ensure comprehensibility.Afterwards the general outline and the novel ideas of the presented method are shown in Sect.3.An evaluation of the presented approach is then given in Sect. 4. The conclusion in Sect. 5 contains a summary of all findings, it's limitations as well as improvements and future work.

Depth sensors
Typical depth sensors, like LiDARs and stereo cameras, measures the surface of the surrounding environment.Using the projective geometry their range images can be converted into a point cloud which represents the surface shape of the surrounding.During this conversion the pixel wise neighborhood information is maintained which is useful for further processing steps like normal feature estimation.In the case that two of these sensors measures the same surface area from different perspectives, this data can be used to align them against each other if their relative pose estimation is inaccurate.The process of doing this is called registration or shape matching which is part of Sect.2.3.

Normal feature estimation
To gain information about the underlying surface, which the point cloud data represents, it is very common in 3-D data processing to determine normal features 1 .We apply here the standard weighted normal averaging scheme which uses triangle combinations incorporating the neighbourhood points (Klasing et al., 2009).In this work we use a weightning factor w j which is chosen as the reciprocal product of the distance between the neighbours q i,j of p i according to Eq. ( 1).We refer to this method as distance weighted cross product (DWCP).

Registration of point clouds
Registration, or shape matching, is the process to align two or more surfaces against each other in such a way that a 1 Vectors assigned to each point in space representing a local plane patch of the approximated surface mathematical metric is minimized.Although other registration methods exists (compare Pottmann et al., 2002) the most popular one is the so called ICP algorithm introduced by Besl and McKay (1992).
The general idea is to find at first point correspondences between two slightly misaligned data sets.This is done usually by searching the currently closest point in the other data set.Secondly a locally optimal rigid transformation matrix is calculated which minimizes a certain distance metric in order to align them against each other.Using the result from the previous iteration this procedure is repeated.New correspondences where determined and the alignment is optimized again until some stopping criteria is fulfilled.
The original formulation of the algorithm by Besl and McKay (1992) delivers bad results if the incorporated data sets does include outliers2 .This happens especially if the two surfaces do not overlap completely or if there are holes due to occlusion effects.As a consequence a still growing number of different variations has been proposed.The main goals are to speed up the convergence rate and improve the robustness against outliers.Therefore, several heuristics and modifications are applied and additional features, like normals, are incorporated.A comparison study of several methods can be found at Pomerleau et al. (2013).
In this work we utilize a point to plane (Chen and Medioni, 1991) based ICP method which utilizes additional rejection methods (Rusinkiewicz and Levoy, 2001).First all correspondence where the angle between the corresponding normals exceeds 30 • are ignored and secondly we remove the worst 10 % due to the Euclidean distance (Pulli, 1999).

Method
The primary goal of this work is to include the captured point cloud data of a sensor platform, like the green car (compare Fig. 1), into the coordinate system of another, e.g. the red car.Each of them has its own coordinate system indicated with the superscript prefix S and T respectively.Thus, an arbitrary point S p = x S , y S , z S T in the source coordinate system can be transformed in the target coordinate system, using T T S .It follows that the corresponding point T p = x T , y T , z T T from S p can be calculated with the following equation.
Unfortunately, the exact transformation matrix T T S is in general not known and has to be estimated by TT S .In order to achieve an initial guess of this matrix, it is further assumed that the target platform estimates the relative position and orientation of the source with a suitable, e.g.satellite based, relative localization system which delivers a state x = x, y, z, φ, θ, ψ T and its uncertainty, as covariance matrix Cov .The accuracy of this transformation can be improved, using a registration algorithm with environment information captured by both sensor platforms simultaneously.
To achieve this the environment data, which was captured by both sensor platforms simultaneously, is determined.Therefore the state and the corresponding uncertainty information, namely the standard deviation values Diag ( Cov ) = σ 2 x , σ 2 y , σ 2 z , σ 2 φ , σ 2 θ , σ 2 ψ , were used to estimate the theoretical field of view (FoV) of the source sensor within the targets coordinate system in order to determine the data within the overlapping FoV.Further on occlusion effects were resolved to achieve the relevant points for the registration process within the target point cloud (denoted in the following as T) and the source point cloud (S).
Once an accurate transformation is determined it can be applied to the whole source point cloud.Both data sets are now fused within the target coordinate systems and available for further processing.An overview of the previously explained steps is given in Fig. 2 and some steps are explained in detail in the following.

Overlapping FoV estimation
The FoV is the geometrical area, which is theoretically visible for a sensor, without taking occlusion effects into account.For the most typical range sensors, like automotive LiDAR, this can be approximated as a pyramid.All sensor data have to lie within this field, thus it can be used as a rough geometrical boundary.Such a pyramid can be defined for the source and target sensor.In order to determine the overlapping part it is necessary to transform them into the same coordinate system.

Uncertainty field of view
Problematic in this context is the accuracy of TT S .The exact origin and orientation of the source FoV is thus not known.
To overcome this problem the track uncertainty is used to increase the source FoV such that it includes the error space, which is considered as normal distributed, up to a certain confidence interval.How this can be achieved efficiently is dependent on its original shape.In this work a pyramid model is used and therefore the basic idea to derive a proper uncertainty field approximation is described in the following.The full derivation can be found at Jähn (2014).
As mentioned already the uncertainty space is assumed to be normal distributed with an expectation value of 0. Consequently a surrounding confidence interval γ • σ would result in a six-dimensional elliptical error space.If this is used to transform the pyramid model into the uncertainty space this would result in an extremely complex mathematical shape.Therefore just the diagonal values of Cov are used to approximated this with a rectangular upper boundary space.With this simplification it doesn't matter if one would take cross correlation values into account or not.To support the understanding of the reader a two-dimensional example is shown in Fig. 3.
Using this rectification, the confidence interval can be described just with the maximum divergence from the state space in positive and negative direction for each dimension.Consequently there are 2 6 = 64 different maximum error variations.If all of these error variations are applied to the FoV pyramid (compare Fig. 4, left panel) the resulting shape can be approximated by a frustum pyramid as it is shown in Fig. 4 (right panel).
This approximation of the FoV can be used to clip the points of the target sensor which could have no corresponding partner within the source data set.

Point uncertainty ranges
In this section it is estimated which parts of both point clouds could be visible to both sensors taking occlusion effects into account.E.g. if an object is located within both fields of view but it is occluded through another object from just one perspective.To figure this out the uncertainty of the initial transformation matrix TT S have to be taken into account again.Therefore the same assumptions, namely the rectification of the uncertainty space, from the previous section are applied.This time the uncertainty space is determined for each point of the source point cloud separately and the resulting space is approximated by a sphere around the point S p i ∈ S with a radius r i according to Eq. (3).
Transferred into the target coordinate system using TT S these spheres represents the area where p T ,i 1 = T T S • p S,i 1 could be if T T S would be known exactly.Further it represents the area where corresponding points of T could be which were captured from the same object.Thus each source point, which has no target point within its range defined by r i , is clipped.The other way around each target point is clipped if it is not within at least one of the source uncertainty spheres (compare Fig. 5).

Evaluation
The objective of the following evaluation is to show the applicability of the presented approach for cooperative environment recognition applications.Compared to the master thesis (Jähn, 2014), where this paper is based on, the data which was used for the evaluation was re-evaluated to fit the limited scope of this paper.For a more exhaustive evaluation please refer to Jähn (2014).

Evaluation strategy
To show the effects of different steps of the presented approach a detailed cross road 3-D model was built with SketchUp ®3 .This was used to simulate range data by two virtual LiDAR depth sensors, which were moved through the static model such that they overlap partially and enough nonparallel surface details for registration were included.The recorded data includes ground truth range data as well as position and orientation of both sensors.This was then superimposed by range and pose (position and orientation) noise.These point clouds were afterwards aligned and the output was compared with the applied pose error vector.This procedure was than repeated 21 times with different data set combinations each with 100 different range and tracking noise samples.Thus the quality of the registration process can be statistically appraised.

Range data generation
The range data was captured using a simple ray tracing approach.Each laser beam of the virtual LiDAR has been modelled as a single ray.In contrast to the simulation applied by Gabriel (2010) the complete sensor optic simulation and the 3 dimensional beam structure was not taken into account.This approach allows a much faster simulation but it does not reflect the special properties of LiDAR sensors, e.g.just the first echo is used.Anyhow this simple sensor model can be applied also to other range sensors, like stereo and time of flight cameras, hence it is used here.For the evaluation in this work we used a LiDAR model with 40 layers each with 451 beams covering 72 • horizontally and 10 • The range image taken directly from the simulation is noiseless and therefore not very realistic.That's why it is superimposed by Gaussian noise.According to the manual (Ibeo Automotive Systems GmbH, 2008) of the Ibeo LUX laser scanner family, a typical standard deviation value σ r is 10 cm, which is used here.
To simulate tracking noise n random error sample from a normal distributed error space is superimposed to x according to Eq. ( 8).The uncertainty standard deviation values σ x , σ y , σ z , σ φ , σ θ , σ ψ were taken from Table 1 to simulate different tracking conditions and quality.As described in Sect.3.1 this will influence how the overlapping area of the sensor data is estimated.

Analysis of the results
Within each test cycle the resulting alignment vector x Alg is subtracted from the previously applied tracking noise vector x Err which results in the remaining error vector x Rem (compare Eq. 9).Over all, in total 2100, test cycles the expectation value of this is assumed to be 0. To allow an easy comparison between the input error space, which is normal distributed, and the output error space, which is in general not normal distributed, the equivalent percentiles are used according to Table 2.These percentiles are calculated for each component of x Rem .If the resulting value of the percentile is smaller than the equivalent interval from the input error space normal distribution the alignment was improved during the registration.

Contribution of clipping & confidence interval
To figure out the contribution of the FoV and uncertainty range clipping from Sects.3.1 and 3.1.2the influence of the applied confidence interval will be investigated now.Therefore the confidence interval parameter γ is modified and an additional test was done where both clipping methods were deactivated.
According to Table 3 the smallest remaining error is for γ = 1.75.With γ = 1.75 and without clipping at all the error increases.This effect decreases with higher tracking noise due to the huge uncertainty ranges where occlusion and shadowing effects cannot be detected.Obviously the rejection step works very efficiently such that bad correspondences are detected reliably.Consequently this pre-processing does affect the alignment accuracy just slightly but noticeable.Thus we conclude that the FoV and uncertainty range clipping stabilize the registration process.Furthermore it has an influence on the number of points, which have to be aligned and   table shows the results of the remaining error percentiles for the point to plane metric with different clipping confidence intervals γ .The light green colour indicates nearly perfect alignment results whereas dark red stands for unsuitable results above 1 • or 20 cm.The achieved accuracy and the necessary number of iterations is quite similar in all cases.Especially for higher tracking noise levels where the uncertainty ranges were huge.
Table 3.The table shows the results of the remaining error percentiles for the point to plane metric with different clipping confidence intervals γ.The light green colour indicates nearly perfect alignment results whereas dark red stands for unsuitable results above 1 • or 20 cm.The achieved accuracy and the necessary number of iterations is quite similar in all cases.Especially for higher tracking noise levels where the uncertainty ranges were huge.

TN3
Within each test cycle the resulting alignment vector x Alg is subtracted from the previously applied tracking noise vector x Err which results in the remaining error vector x Rem (compare equation 9).
Over all, in total 2100, test cycles the expectation value of this is assumed to be 0. To allow an easy 190 comparison between the input error space, which is normal distributed, and the output error space, which is in general not normal distributed, the equivalent percentiles are used according to table 2.
These percentiles are calculated for each component of x Rem .If the resulting value of the percentile is smaller than the equivalent interval from the input error space normal distribution the alignment was improved during the registration.

Contribution of Clipping & Confidence Interval
To figure out the contribution of the FoV and uncertainty range clipping from chapter 3.1 & 3.1.2the influence of the applied confidence interval will be investigated now.Therefore the confidence interval parameter γ is modified and an additional test was done where both clipping methods were 200 deactivated.
According to table 3 the smallest remaining error is for γ = 1, 75.With γ = 1, 75 and without clipping at all the error increases.This effect decreases with higher tracking noise due to the huge uncertainty ranges where occlusion and shadowing effects cannot be detected.Obviously the rejection step works very efficiently such that bad correspondences are detected reliably.Consequently 205 this pre-processing does affect the alignment accuracy just slightly but noticeable.Thus we conclude 10 thus the computational complexity is decreased drastically.This holds true especially if the overlap is small compared to the rest.A smaller confidence interval reduces the range where potential correspondences could be found.This has positive effects if the error lies within this interval.In this case it does not happen that some of the potential good corresponding partners were clipped.However, if the current error is located outside of the confidence interval, points were clipped which could have a good corresponding partner in the other set.Thus the number of details, which can be used for the alignment, is reduced together with the overall possibility that ideal pairings can be found in general.This limits the alignment abilities drastically if the current initial error is high.
Thus we conclude that the uncertainty FoV and range clipping is especially useful for low tracking noise.There the number of points, which have to be aligned is reduced significantly and potential outliers were detected even before they could take any bad influence on the registration process.The confidence interval should not be chosen too small to avoid the clipping of potential useful point pairings.For higher tracking noise the uncertainty range clipping is less useful because occlusion effects can be detected just roughly.Therefore, outliers within the overlapping area were not thrown out.The FoV clipping on the other hand is always useful because the number of points, which have to be aligned, is decreased at low computational costs and potential wrong pairings were avoided.

Conclusions
The results show that an alignment of two data sets can be achieved with the help of a pose tracking system although they were captured from completely different perspectives.During the previous evaluation the contribution of certain system design aspects have been examined and the findings are summarized in the following sections.
The primary goal of the FoV clipping was to determine the overlapping area of the sensors FoV due to the pose estimation uncertainty.This reduces the number of points, which have to be aligned, significantly especially if the overlapping part is just a small subset of the complete data.Thus it further reduces the amount of possible outliers, which were not detected by the applied heuristics.That's why we conclude that the utilization of this method is always recommendable.
The idea of the uncertainty range clipping was to find the points, which could have a corresponding partner in the other data set, taking occlusion effects into account.This works fine if the tracking accuracy is already good because in this case the uncertainty ranges are small and a lot of potential outliers are ignored.Additionally the number of points, which have to be aligned is reduced further, which improves the overall performance.However, if the tracking noise is high, the contribution of the uncertainty range clipping is reduced drastically.That's why the uncertainty ranges are too big to detect occlusion effects, caused by small and medium sized objects like cars and trees, effectively.Nevertheless, occlusions caused by huge objects like buildings, were detected properly and thus points, which cannot have a corresponding point in the other data set are ignored.However, it is heavily situation dependent, whether the benefits justify the computational costs of this method.
According to Table 3 a rotation and translation accuracy below 0.5 • and below 15 cm is attainable in most cases.Under good circumstances with a lot of details this falls even under 0.1 • and 10 cm.This should be enough for a lot of practical applications.Using the improved transformation result it is now possible to fuse the complete data of both sensors.

Limitations of the presented approach
The presented approach is suitable for scenarios where two depth sensors observe partly the same surface of an arbitrary scene from completely different viewpoints.Additionally their relative pose have to be roughly known, due to some also known confidence interval.However some additional conditions must be met.First of all, enough details should be included in the overlapping part, such that an unambiguous registration result is possible.Therefore, it must contain at least 3 non parallel surface patches.This implies further that the point density in this region is sufficient to estimate proper normal features and point correspondences.Additionally the extent of this region has to be significantly higher than the measurement noise of the sensor data and the tracking error.

Further improvements and future work
So far there is no method applied which checks if the above mentioned requirements were fulfilled.For practical applications this have to be checked for each sensor data pair separately before the presented approach can be applied.If that is successfully done the plausibility of the registration result should be checked.Currently there is no method applied which is able to determine the quality of the registration result.A completely wrong alignment can be detected easily by evaluating, if the alignment result lies within the confidence interval of the applied tracking mechanism.For a further assessment of the achieved alignment accuracy it is necessary to develop a suitable metric.This could be done based on the remaining average distance between the correspondences found in the last ICP iteration.Further on the similarity of the normal features between the point pairs could be a helpful measurement.
Edited by: M. Chandra Reviewed by: two anonymous referees

Figure 1 .
Figure 1.The figure shows two cars which could exchange their environment information if an accurate enough pose estimation would be available (left picture).If the pose estimation is inaccurate this would yield an alignment error (center picture).The right picture shows the coordinate system definitions which where used in the paper.

Figure 2 .
Figure 2. Overview of the processing steps within the presented approach.

Figure 3 .
Figure 3. Within this chart three different 3σ (γ = 3) confidence intervals of the same two dimensional correlated normal distributed sample set are shown.The red one uses the full covariance matrix whereas the green one just neglects the covariance values.The rectangular confidence interval includes both ellipses and thus acts as an upper bound.

Figure 4 .
Figure 4.The two pictures show the coordinate system definitions and the nomenclature of the original field of view (left) and the increased uncertainty field of view (right).

Figure 5 .
Figure 5.To each source point an uncertainty range is assigned.Within each sphere at least one target point has to be present and each target point have to lie within at least on sphere.

Table 2 .
Percentiles from N (0, σ ) intervals for direct comparison with non-normal distributed error spaces.

Table 3 .
The